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Preface 


This book is intended to serve as a comprehensive text for the use of 
students in tests and measurements courses, whether in the field of psy- 
chology or education. It can also serve as a helpful adjunct to graduate 
courses in measurement, and as a practical guide to psychological testing 
for teachers and administrators. For use in undergraduate courses, the 
material has been organized in such a way that some of the more difficult 
sections can easily be omitted if necessary, For example, if his students 
have had little or no training in statistics, the instructor may wish to omit 
all or parts of Chapters 6, 7, 8, and 9. 

Tests and Measurements: Assessment and Prediction is based on four 
principles. The first of these principles is that the student must begin by 
learning some statistical concepts and procedures before he can thor- 
oughly understand the measurement of human behavior. One of the major 
aims of this book is to make the statistical concepts relevant to such meas- 
urement both understandable and interesting. Although most students 
tend to be afraid of statistics and reluctant to learn about them, these 
negative attitudes are probably traceable more to poor methods of instruc- 
tion than to the intrinsic nature of the subject matter. All too often statis- 
tical formulas are presented to classes without adequate preparation or 
explanation, In Tests and Measurements, the author has undertaken to 
provide the reader with some insights into why statistical methods are 
necessary and how they work. 

The second principle upon which this book is based is that the reader 
should be given a broad acquaintance with the whole subject of meas- 
uring human behavior, not just with the measurement of individual dif- 
ferences, Therefore, measurement in various kinds of psychological 
research studies is discussed in Chapter 1; an introduction to psycho- 
physics is given in Chapter 2; Chapter 17 specifically concerns measure- 
ment techniques for studies of personality and social psychology; and the 
ethics of performing research studies are discussed in Chapter 18. 
Although the major emphasis of the book is on studies of individual 
differences, these studies are presented in a context of general psychology; 
and as much material as possible has been introduced on measurement 
methods used in psychological experiments. 
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viii Preface 

The third principle upon which this book is based is that the methodol- 
ogy of measurement is the proper subject matter for a tests and measure- 
ments course. Certainly a sound presentation of methodology is of more 
value to the students than a detailed description of particular tests and 
measurement techniques. For one thing, there are far too many tests and 
measurement techniques available to permit the reader to gain an 
acquaintance with even a fraction of them. Then, too, detailed study of 
particular tests and measurement techniques is of rather limited value, 
since we now know how to construct better instruments than most of 
those that are currently in use. 

The fourth principle is that controversies about human measurement 
should be met head on rather than avoided. Under the guise of eclecti- 
cism, textbooks often hide from the reader the fact that there are dif fering 
points of view on many issues. If a person does not become acquainted 
with the major controversies in a field and does not learn about the con- 
ceptual and mathematical tools that are used in these controversies, he 
has at best only a shallow acquaintance with the subject matter, Conse- 
quently, the author has chosen to emphasize critical analyses of tests and 
of general approaches to measurement. It is hoped that these critical 
analyses will acquaint the reader with the major controversies in human 
measurement, will sharpen his ability to criticize measurement methods, 
and will help him make the decisions that he will face when applying 
his learning in actual testing and measurement situations. 

The author is deeply indebted to those persons who read and criticized 
portions of the manuscript. Particular thanks are given to Doctors Lloyd 
Humphreys, Ted Husek, Charles Osgood, George Suci, and to Mr. How- 
ard Bobren. The book owes a great deal to the patience and fine secre- 
tarial skill of Mrs. Freda Schell. 


Jum C. Nunnally, Jr. 
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CHAPTER 1 


The Foundations of Psychological Measurement 


Psychological measurement is a close and important concern to all of 
us. The student who reads these pages for a course assignment is very 
much interested in one type of psychological measurement: the method 
of grading used by the instructor. The person who aspires to higher edu- 
cation needs the best possible measurement of his capabilities before 
wagering his future. A successful marriage depends on the personalities 
of the partners, and it would be worth a great deal to a young couple to 
have their compatibility measured in advance. Many millions of dollars 
are spent in measuring the public reaction to political candidates, new 
brands of soap, and television programs. In these and many other ways, 
psychological measurement is a part of everyday living. 

In spite of the many needs for psychological measurement methods, 
few had been developed up to the turn of this century. It is surprising 
that men learned so much of the world around them without learning 
more about themselves. A technique was developed for measuring the 
circumference of the earth two thousand years before psychological tests 
came into general use. There are still many weak spots in the storehouse 
of psychological measurement methods. Important decisions must often 
be made about people on the basis of very crude “yardsticks.” The slow- 
ness with which psychological measurement has developed is due in part 
to the complexity of the individual human. Some mistaken ideas about 
the purpose and substance of psychology have also retarded the develop- 
ment of measurement methods. These will be discussed in the following 
sections. 

The Nature of Psychological Data. The development of psychologi- 
cal measurement and of psychology as a whole was delayed, in part, by 
some misunderstanding about the nature of the subject matter. The 
philosopher Kant once said that it would not be possible to have a science 
of psychology because the basic data could not be observed and meas- 
ured. Modern psychologists would agree in part with Kant, agree to the 
extent that only certain kinds of psychological phenomena are open to 
observation and measurement. 
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Psychological science concerns itself with the study of human behavior, 
with the actions, judgments, words, and preferences of individuals. The 
study of such tangible behavior complies with the requirements of scien- 
tific research. However, there are other phenomena that are spoken of as 
being “psychological” which cannot be the substance of direct scientific 
study. Some of these are sensations, feelings, and images. 

Throughout the history of philosophical thought there has been a 
tendency to divide psychological phenomena into those that are “physi- 
cal” and those that are “mental.” This division of phenomena is referred 
to as psychophysical dualism, the belief in separate mental and physical 
processes. Because of this unfortunate logical standpoint scholars have 
been drawn into many needless arguments about the “connections” 
between the mental and the physical. 

By adopting a set of simple rules, the seemingly difficult problem of 
psychophysical dualism can be circumvented. The purpose of scientific 
effort is to test statements about the world of events—all those events that 
can be seen, heard, touched, or otherwise experienced in common. An 
illustrative scientific statement is, “Yellow fever is a virus infection.” The 
phenomenon being studied might itself be impalpable, such as magnetism, 
atomic action, or the transfer of heat. But our knowledge of the phenom- 
enon must always come from publicly observable events: the changing 
direction of a compass needle, the action of a Geiger counter, or the read- 
ing on a thermometer, 

In order to accept a statement as either true or false, there must be a 
demonstration of evidence in the world of observable events. Only then 
can a statement be tested. The evidence must be such that it can be exam- 
ined by others. The procedures for obtaining the evidence should be 
sufficiently clear to allow independent investigators to gather new evi- 
dence in respect to the problem. These are some of the major rules that 
have been adopted by scientists, and the rules demonstrate their worth 
by the progress that science has made during the last several hun- 
dred years. 

Instead of trying to pass judgment on the whole field of psychology as 
to whether it is “mental” or “physical,” it is more logical to examine par- 
ticular kinds of statements in psychology to see whether they meet the 
requirements of scientific inquiry. In this way we will find many state- 
ments that are testable, as well as many statements that we will want to 
avoid because they are untestable. 

Some testable and untestable statements in psychology can be illus- 
trated if the reader will imagine that he has in his hand a large, red, 
juicy-looking apple. Take a bite of this make-believe apple. Does it taste 
good? If you had a friend there with you, would it taste “better” to him 
tkan to you? How much more pleasant would be his sensation—twice as 
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much, or three times as much? This is an issue that cannot be studied 
directly. There is no way to measure your sensations directly. They are 
uniquely yours in a way that exempts them from direct observation by 
others. Lift the apple in your hand. Do you have a feeling of “weighti- 
ness” in response to the apple? How “large” is your feeling? This is another 
phenomenon that cannot be observed by others or measured directly. 

There are some issues that could be studied in respect to the apple. 
Which would you rather have—an apple, an orange, or a pear? The way 
in which you respond to the question will demonstrate your preferences, 
and your preferences can be compared with those of other persons. How 
much do you think the apple weighs—7, 8, or 9 ounces? Your judgment 
can be compared with the weight of the apple, and a measure of your 
accuracy can be obtained. 

Your subjective experiences—your feelings, sensations, and desires— 
cannot be observed by others, and as such cannot be subjected to meas- 
urement. Once you do something in respect to your feelings—make a 
judgment, state a preference, or even talk to others about the experience— 
then these behaviors meet the requirements of scientific inquiry, and 
measurement becomes possible. 

It may seem like a trivial point to say that an individual's sensation of 
pleasantness is not open to scientific study but that his report of pleasant- 
ness is legitimate scientific data. However, failure to make this distinction 
forces the scientist into some knotty problems about the nature of his 
data. If it is claimed that the individual’s sensation of pleasantness is 
being measured, then one can always inquire as to the proof that the 
individual really enjoys the apple. The answer is, of course, that there is 
no way to prove that the individual enjoys the apple or, for that matter, 
to prove that there is any sensation at all related to the apple. We have 
only the individual's word for liking or not liking the apple, and the 
verbal report, in this instance, is the basic datum with which the scientist 
must deal. 

Being restricted to what the individual does, or says, does not fore- 
shadow defeat in psychological work. There is much that can be learned 
about human behavior, and there are many ways in which one human 
action can be predicted from others. For example, the individual's stated 
preferences for fruits would likely be predictive of what he would pur- 
chase or what he would choose to eat. More importantly, the scores that 
a person makes on psychological tests can predict approximately how well 
he will perform in many real-life situations. 

Emotional Barriers to Psychological Measurement. Psychological meas- 
urement efforts are often reacted to as being “nosey” activities—asking 
personal questions, embarrassing people with their shortcomings, and 
experimenting in emotionally laden matters. It gives most of us a twinge 
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of anxiety to see ourselves portrayed as a set of statistics or to have our 
abilities compared with those of others. We like to think of ourselves as 
being unique in a way that exempts us from study. 

Psychology has been held in disfavor by persons who think of it as a 
denial of “free will.” Their argument is that if a person's behavior is pre- 
dictable then the individual has no choice in his actions, no freedom of 
will. Psychology starts with no a priori viewpoints about the nature of 
man but looks instead for the regularities that occur in human behavior. 
When such regularities are sought, they are found in abundance. For 
example, a white object will nearly always be regarded as larger than a 
black one of the same size. Persons who make low scores on “intelligence” 
tests seldom manage to complete college training. If you are like most of 
us, you are consistent to some degree about the kinds of clothes that you 
buy, the types of people that you choose as friends, the place where you 
like to sit in a theater, and many other things that can be measured and 
studied. 

Some people think of psychology as an attempt to intrude on religious 
and ethical beliefs. Because of psychology’s behavioristic approach, it 
automatically prevents itself from investigating the individual's experi- 
ential world. Psychology can study the differences in ethical and moral 
practice among people but cannot reach decisions about which beliefs 
are more real or more correct. 

The Need for Measurement. Some form of measurement is essential to 
all scientific studies. Without the precision of mathematical language and 
the deductive possibilities which it allows, it would be very difficult to 
express most statements in a way that would allow them to be tested. 
Consider the hypothesis that tornadoes occur on days when the atmos- 
pheric pressure is high and a sudden drop in temperature occurs. We 
would first want to inquire what is “high” pressure, high in relation to 
what, how much in an absolute sense. This part of the statement could be 
rephrased in terms of barometer readings, The question of “how much” 
would come again in respect to the phrase “sudden drop in temperature.” 
How much of a drop is a sudden drop—10 degrees over a period of 
30 minutes? We would need the measurement techniques provided by 
clocks and thermometers ‘to make the original statement more specific. 
You can see that a hypothesis as simple as the fictitious one here requires 
several kinds of measurements. The problem would be all but impossible 
to state precisely without reference to measurement methods. 


MEASUREMENT IN PSYCHOLOGY 


There are a number of special fields in psychology. Some of these are 
clinical, social, physiological, and differential psychology. Research in all 
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areas of psychology is dependent on adequate measurement methods. In 
some areas, the measurement problems have proved more difficult than in 
others. For example, the clinical psychologist finds it difficult to measure 
things like the improvement of patients in psychotherapy and the severity 
of emotional disorders. One of the most difficult problems for psycholo- 
gists is to find ways to measure important human attributes. 

Individual Differences and Psychological Processes. There are two 
basically different kinds of measurement problems encountered in psy- 
chology. These are the measurement of individual differences and the 
measurement of processes. The study of individual differences, differential 
psychology, is sufficiently broad to be considered a special field. As the 
name implies, differential psychology concerns the ways in which people 
differ from one another. People differ in terms of their walking speeds, 
ability to learn Latin, sensitivity to pain, susceptibility to hypnosis, and in 
countless other ways. Some of these individual differences are of little 
importance in our daily lives; others are of paramount importance for 
success and happiness. 

A psychological process is a series of responses by one person. The 
process of reacting to alcohol is a useful example, First there is a decrease 
in pulse rate, an increased feeling of pleasantness, an increase in reaction 
time, and finally, gross loss of coordination. The characteristic changes 
would be the same for all persons regardless of the initial levels of pulse 
rate, reaction time, and so on. 

Another example of the difference between processes and individual 
differences can be illustrated with a learning experiment. Any person 
who performs learning experiments with animals, say, with white rats, 
finds that some rats learn faster than others. Instead of being interested in 
individual differences among rats, the experimenter is usually more inter- 
ested in the learning process. That is, he would like to determine the 
regularity of improvement and the variables which affect improvement, 
regardless of individual differences. The family of learning curves in 
Figure 1-1 illustrates the problem. 

The curves in Figure 1-1 show the amounts of time taken by four rats 
to reach the goal in a maze on each of eleven trials. On the first trial it is 
apparent that there are individual differences among the rats in the time 
taken to reach the goal. Rat No. 1 took eighty seconds, and rat No. 4 took 
only fifty-five seconds. Also, we see that the order of the rats on the first 
trial is maintained in most of the other trials, showing that some of the 
rats are more capable in the particular type of learning problem. 

A second type of information to be seen in Figure 1-1, and generally 
the most sought-after information in studies of this kind, is that the 
process of learning tends to be the same for all four rats. Regardless of 
individual differences at each trial, the rats as a group tend to learn in the 


6 Tests and Measurements 


same pattern. This group tendency can be seen better by looking at 
the curve of average times. This was obtained by adding together the 
times for the four rats at each trial and dividing by four. This curve shows 
that for the rats as a group the learning is much faster during the first 
several trials. The rate of learning for the group levels off after that, and 
only slight improvement for the group can be found after the eighth trial, 


Ficure 1-1. Hypothetical curves showing the amounts of time taken by four rats to 
reach the goal in a learning maze. 
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In studies of individual differences, the experimenter usually hopes that 
differences among people will be as large as possible. If such differences 
are small, there is nothing with respect to individual differences to study. 
In studies of processes, the experimenter usually hopes that individual 
differences will be as small as possible, because they only serve to compli- 
cate the regularity that he is seeking to measure, 

Measurement in Differential Psychology. Measurement methods are 
particularly essential in differential psychology. Differential psychology 
concerns two kinds of problems: (1) the logic and methods of measuring 
individual differences, and (2) the study of how individual differences 
arise and how they relate to one another. The first concerns the tools for 
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measuring individual differences; the second concerns the facts that have 
been established by applying the tools. 

The remainder of the book will be more concerned with the subject 
matter of differential psychology than any other special area, and the 
emphasis will be on the measurement of individual differences. Attention 
will be given to some of the problems in the measurement of processes. 

Although the major emphasis in this book is on the study of individual 
differences, we should not overemphasize the conclusions that can be 
obtained from studies of this kind. There are limits to what can be learned 
about human behavior from individual differences alone. Although studies 
of individual differences have proved very useful, it should be remem- 
bered that it is the study of processes with which psychology is more 
extensively concerned. 

The study of individual differences has a wide appeal in the United 
States. Not only has much of the work in this field of study been done by 
persons in our country, but the subject of individual differences has a 
particular appeal for the American public. There is perhaps no other 
country in which a person can attract so much attention by being differ- 
ent or doing something unusual. There are contests to see who can eat the 
most pies, sit longest on a flag pole, and dance longest without falling 
from exhaustion. 

One of the major reasons we are interested in differences among people 
is that this is a “land of opportunity,” as the well-worn phrase puts it. 
There is more opportunity for an individual to use his ability to ‘attain 
fame and fortune than in countries where either the class structure or the 
lower standard of living provides less room for the individual to get 
ahead. If a person is better looking, can sing better, is more intelligent, is 
a better athlete, he may be on the road to great personal gain. Conse- 
quently, people are usually eager to find some way in which they surpass 
others, even if it is in so remote a behavior as being able to hold their 
breath longer. 

As evidence of our interest in individual differences, American maga- 
zines are replete with short, and usually inadequate, tests of vocabulary, 
popularity, marriage happiness, and many others. The mother is eager to 
see that little Johnny learns to walk before the neighbor’s child. She wants 
to have his IQ measured, and hopes that he is a budding genius. 

The study of individual differences has been given considerable atten- 
tion during the last fifty years. One reason for the interest of psychologists 
in this type of work is the success that the testing movement has encoun- 
tered. Some areas of psychology can be more properly regarded as sci- 
ences of the future. The groundwork is being laid, but the present store- 
house of facts is not large. There is also a great deal yet to be learned 
about the measurement of individual differences. The measurement of 
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personality characteristics has proven a particularly onerous stumbling 
block. Even considering the uncharted regions in the measurement of 
human behavior, the testing movement in America has reached an ad- 
vanced stage of technical sophistication and proved practical importance. 


MEASUREMENT SCALES 


In the discussion to follow, we will speak of a “scale” as denoting the 
procedure, or “yardstick,” with which a particular measure is obtained. 
Measurement scales are part of everyday living: the ruler as a scale to 
determine length, the thermometer to indicate temperature, and the 
achievement test to measure success in school. We customarily place 
numbers along scales, such as inches, degrees Fahrenheit, and test scores. 
However, these are not the same kind of numbers; they mean different 
things about “how much,” and they have different mathematical uses. 
Some of the major kinds of measurement scales will be discussed in the 
following sections. 

Labels. Labels, or tags, are used to signify that one object or person is 
distinct from another. The numbers on the backs of baseball players are 
labels. They tell nothing at all about quantities. The player numbered 44 
is not necessarily twice as good as 22. If we add the labels for these two 
players, 66 is obtained. Does this sum have any meaning? Any other num- 
bers would have served as well, or the players could be distinguished by 
different colors, alphabetical letters, or geometric designs. Labels have no 
mathematical properties and are not scales at all. Other examples of num- 
bers used as labels are the numbers on theater seats, numbers used to 
designate highways, and the numbering of stamps in a collection. Not 
every set of numbers is necessarily a set of measures. The investigator 
must ensure that the numbers which he studies are really quantities rather 
than mere labels. 

Categories. Categorization, or qualitative distinction, is the most ele- 
mentary type of scientific datum. The biologist makes a distinction between 
insects that have wings and those that do not. The chemist classifies 
liquids into those that are acid and those that are alkaline. Mental patients 
are categorized into groups such as depressive and schizophrenic. Mem- 
bers of the general public are categorized in terms of their preferences for 
political candidates, 

Categorization implies that the things in a category have something in 
common and that they differ in some characteristic from the objects in 
other categories. The difference can be illustrated with a geologist’s col- 
lection of rocks, The geologist might first apply labels to his collection, 
numbering them from 1 to N, simply as a way of keeping track of all the 
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specimens. Later he might categorize the rocks as to whether they are 
sedimentary or igneous in origin. 

It was said previously that no mathematical significance could be 
placed on labels. Although this is a humble beginning, there are complex 
mathematical systems that work entirely with categorical data. 

The number of categories used in particular problems may be only two, 
or, in some studies, the number of categories is large. An example of 
multiple categorization would be the description of people in terms of 
occupations: farmers, plumbers, lawyers, and so on. 

When using categories, no distinction is made among the objects or 
persons within a category. We might think of the members of any cate- 
gory as being equated to the number one—one plumber, one schizo- 
phrenic, or one alkaline liquid. This would then give no information as to 
whether one plumber is better than another or whether one liquid is more 
alkaline than another. Also, from the categorizations alone, there would 
be no way of knowing that lawyers receive higher incomes than farmers 
and plumbers, or that any relationship exists among the categories. 

The results of psychological studies are sometimes reported in terms of 
categories. There are a number of ways of statistically analyzing data of 
this kind, some of which will be mentioned in the chapters ahead. Chief 
among these procedures is the description of populations of people in 
terms of the frequencies that fall in various categories and the comparison 
of categories for the amount of overlap. Illustrating the former, mental 
hospital patients can be characterized as schizophrenic, paranoid, manic- 
depressive, or other. The latter type of procedure might be used to show 
that the members of one category of mental patients more frequently 
come from home environments of a particular kind than do the members 
of other categories of mental patients. 

Ordinal Scales. The ordinal scale relates persons or objects to one 
another by ordering or ranking them in respect to an attribute. Course 
marks are sometimes reported as ranks. The person who performs best is 
given a rank of 1, second-best receives a rank of 2, and so on to the person 
with the worst performance. As a convention the ordering is symbolized 
as 1, 2, 3,... , N, with the understanding that the numbers mean first, 
second, third, . . . , Nth. 

The essence of the ordinal scale is the concept of “greater than” por- 
trayed by the symbol (>). Thus a > b means that a is greater than b, or 
that b is less than a. To have an ordinal scale, it must be established that 
a>b>c>d>e>...>N for N number of persons in respect to 
any particular attribute. The concept of “greater than” characterizing 
ordinal scales is a kind of information in addition to the concept of “dif- 
ferent from” used with categories. 
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An ordering, or rank-ordering, constitutes a scale of measurement, but 
the use of such scales has definite mathematical limitations. The ordinary 
arithmetical operations of addition, subtraction, multiplication, and divi- 
sion make little sense with ordinal numbers. It is not sensible to add 
together the first and third man and equate this in any way with the 
fourth man, as the ordinary operations of arithmetic would lead us to 
expect. 

Two important kinds of information are lacking in the ordinal scale. 
The first is that no information is provided as to how well the group per- 
forms as a whole. If there are six students and no ties in examination 
grades, there will be ranks 1, 2,..., 6 regardless of how well or how 
poorly the whole group performs. The ranks might have been determined 
within a group of geniuses, or the ranks might as easily have been 
obtained within a group of only moderately capable students. 

The second important information that is lacking in the ordinal scale is 
that of the dispersion of performance. That is, there is no way of telling 
from the ranks alone how closely the second man approaches the first 
man, or how much better the first man performs than the man who is 
ranked last. If there are two classes and thirty students in each class, 
there will be ranks 1 through 30 in each class. There is no way of telling 
from the ranks whether the dispersion of ability is larger in one class than 
in the other. The students in one class might have performed much the 
same, and the students in the other class might have varied widely from 
one another in performance. 

The most important type of mathematical operation that can be applied 
to ranks is that of correlation. In this way, it can be determined to what 
extent persons are ordered alike in two circumstances, It could be deter- 
mined whether the persons who make grades near the top in history also 
make high grades in mathematics and additionally, an over-all numerical 
index of agreement could be found over the extent of both orderings. The 
ability to correlate two sets of measures satisfies the most basic require- 
ment for evaluating tests. Correlational analysis will be discussed in 
detail in Chapter 5. 

Interval Scales. If, in addition to an ordering of persons, a knowledge 
is obtained of the interval or the distance between persons, then an inter- 
val scale is produced. The results of a foot race could be reported in the 
form of an interval scale. The customary procedure is, of course, to state 
the exact running time for each participant. To illustrate the character- 
istics of interval scales, assume that the results are reported in terms of the 
interval between the first and second man, the second and third man, and 
so on. It could be said that the second runner came in one second behind 
the first, the third runner came in two seconds behind the second man, 
the fourth came in one second behind the third, and so on for all intervals. 
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Like ordinal scales, interval scales also provide no information about 
how well the people perform as a group. Because the results of the race 
were reported only in terms of intervals, we have no information about 
the absolute running times for the men. If we learn later that the first 
runner took ten seconds, then we know that the second man took exactly 
eleven seconds, the third man thirteen seconds, and so on for the other 
runners; but if the first man had taken twenty seconds, then we learn that 
the second man took twenty-one seconds, indicating that the group as a 
whole is much slower. 

The important advantage of the interval scale over the ordinal scale is 
that knowing the interval allows for the determination of the dispersion, 
or spread, of the scores. The range of scores is the interval, or distance, 
between the person who stands highest and the person who stands lowest 
on the scale. Also, the standard deviation of a set of scores can be deter- 
mined without a knowledge of the absolute level of measurements. 
(Measures of dispersion will be discussed in Chapter 3.) 

The arithmetical operations of addition, subtraction, multiplication, 
and division can be used with interval scales only in respect to the differ- 
ences between scores. For example, the interval between the men who 
finish the race first and second could be compared directly with the inter- 
val between those who finish third and fourth. 

The ordinary Fahrenheit thermometer is an example of an interval 
scale. The zero point on the scale is arbitrary. It is not meaningful, for 
example, to say that 90 degrees is twice as warm as 45 degrees. 

Ratio Scales. The final member in the hierarchy of measurement scales 
is the ratio scale. The ratio scale requires that the absolute measurement 
of each person be known. Then in the race that we used to illustrate 
interval scales, we might find that the first man took eleven seconds, the 
second man twelve seconds, the third man fourteen seconds, and so on 
for the remaining runners. Specifying the absolute level for running, or 
for any other attribute, is the same as saying that the zero point on the 
scale is known. 

In addition to its own special virtues, the ratio scale has all the advan- 
tages of the less powerful scales. All of the operations of arithmetic as 
well as the tools of higher mathematics can be applied to ratio scales. As 
the name implies, one score can be divided by another, multiplied by 
another, subtracted from and added to any other. 

Ratio scales are used quite frequently in everyday life. No simpler 
example is available than the weighing of two objects on a “scale.” Com- 
parisons of salaries, heights of individuals, numbers of rooms in buildings, 
and many, many other quantities are treated as ratio scales. It is so cus- 
tomary to deal with ratio scales that it is easy to make the mistake of 
assuming that all measures are of this kind. For example, it might be said 
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that John is “twice” as handsome as Bill. It is difficult in this case to see 
how the ratio “twice” could be obtained, and it is reasonable to suspect 
that a ratio scale is being used where only an interval or even an ordinal 
scale is justified. 

Test Scores as Scales. Most test scores are treated as though they are 
interval scales. Without such an assumption there would be no way of 
obtaining measures of dispersion. As will be shown later, the standard 
deviation and its companion statistic, the variance, are crucial to the 
gathering of test norms and to the determination of how well tests work. 

It is not justifiable to treat most test scores as though they constitute 
ratio scales. The absolute scores that people obtain are largely artifacts 
of the ways in which tests are constructed. For example, a class instructor 
might decide to give five points for each correct answer, or he might 
decide on ten points instead. Such decisions markedly affect the absolute 
size of scores. For example, suppose that persons a, b, and c make scores 
of 20, 15, and 12 on a test. Then, if we assume that a ratio scale is logical 
in this case, it could be said that a demonstrates 1.33 times as much of 
the particular ability as does b. Suppose that the test were revised by 
adding five items which are so easy that a, b, and c get them all correct. 
Now, the scores for a, b, and c would be 25, 20, and 17 respectively. This 
changes the ratio between the score of a and b to 1.25. We see in this way 
that ratios computed from test scores are not usually meaningful. 

In some instances, psychological measures are justifiably treated as 
ratio scales. Some of these would be the number of trials needed to learn 
a particular task, the length of time taken to respond to a visual signal, 
and the amount of perceptual distortion induced by a visual illusion. 
There have been some interesting attempts to deduce ratio scales for 
certain types of psychological measures, particularly for measures of atti- 
tudes and for measures of human judgment. 


SUMMARY 


Whether or not it is recognized and designated as such, psychological 
measurement is an important ingredient of everyday life. Systematic 
measurement methods were rather late in coming, and only during the 
last one hundred years has the problem been carefully studied. A major 
drawback to the development of measurement methods has been the 
failure to distinguish the kinds of psychological phenomena that can and 
cannot be measured. Psychological science is concerned with human 
behavior, with the actions, words, judgments, and preferences of people— 
all of which are open to measurement, Psychological science is not con- 
cerned with purely subjective phenomena; until the individual does some- 
thing about his feelings, there is nothing to measure. 
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We are so accustomed to measuring physical objects and assigning 
numbers to them on the basis of ratio scales that it is easy to assume that 
all measurements are of that kind. However, many measurements, and 
particularly those in psychology, must be made on a cruder basis. Conse- 
quently, it is important to specify the type of measurement scale which 
is in use. This will indicate the kinds of mathematical procedures that 
can be legitimately employed. 

Although we should be impressed with the need for measurement 
methods, this should not dim the importance of simple human observation 
and thought in the search for scientific lawfulness. Measurements are 
helpful to the scientist in explaining and exploring theories, but only the 
human observer can invent theories. No amount of elaborate measure- 
ment can make up for a lack of ideas on the part of the experimenter. 
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CHAPTER 2 


The Growth of Measurement Methods in Psychology 


People have always been interested in the measurement of human 
attributes, and embryonic studies can be traced even to ancient China. 
However, it was only one hundred years ago that the first systematic 
attacks were made on problems of psychological measurement. During 
the nineteenth century, the field of psychological measurement was nour- 
ished by two major influences. First, psychological measurement bor- 
rowed heavily from the concepts and tools that had been applied so suc- 
cessfully in physics, chemistry, and astronomy. The nineteenth century 
was a period of great progress in the physical sciences, and it was rea- 
soned by men of that day that the methods of physical science could be 
applied to the human mind. It was during this period that psychophysics 
was born: the precise and quantitative study of how human judgments 
are made. Psychophysics has a proud tradition, and it has had a whole- 
some influence on measurement methods in all areas of psychology. 
Because of the importance of psychophysics in its own right and because 
of the impact that it has had on psychological testing, the development of 
psychophysics and some of the measurement methods will be discussed in 
this chapter. 

The second major influence on the development of psychological meas- 
urement methods was the clinical tradition growing out of medicine, psy- 
chiatry, and social welfare research. In these efforts there was an urgent 
need for methods of measuring emotional stability and intelligence. The 
first practical mental tests grew out of this tradition and were applied to 
the problem of classifying mental deficients in public schools. 

It has been only during the last several decades that these two tradi- 
tions have merged. The union of scientific technology and mathematics 
with the clinician’s intuitive approach and his concern for people pro- 
duced the modern methodology of testing. 

In addition to the two major influences described above, a number of 
other historical trends had a prominent influence on the development 
of psychological measurement methods, There are many needs for psy- 
chological tests in education, and measurement methods have grown at 
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the same rate that educational facilities have grown. Because of the many 
direct uses for statistical methods in the development and use of psycho- 
logical tests, the early developments in the discipline of statistics were 
prominent influences on the growth of psychological measurement. 

The theory of evolution had a very important influence on psychology 
as a whole and particularly on the field of psychological measurement. 
Although evolution developed in biology and had its most important 
influence there, it helped foster new concepts of human behavior and 
interested psychologists in measuring human adjustment and the inheri- 
tance of psychological traits. 

More recently, the two great wars of this century had very important 
impacts on psychological measurement. In both conflicts psychological 
tests were needed for the selection and classification of armed forces per- 
sonnel. Never before had psychological tests been used in such wholesale 
quantities, and they proved their worth so well that psychological testing 
received wide public acceptance. 

Each of the major trends mentioned above will be discussed in some 
detail in this chapter. Only by understanding these historical roots can 
the modern methodology of psychological measurement be seen in its 
proper perspective. 


PSYCHOPHYSICS 


The Personal Equation. Considerable progress was being made dur- 
ing the eighteenth century in the development of scientific instruments: 
chronographs, telescopes, compasses, microscopes, and many others. Al- 
though scant attention was given to the place of the human observer in 
the use of these instruments, a fortunate mishap stirred interest in the 
problem. At the observatory at Greenwich, England, in 1796, an astron- 
omer’s assistant, named Kinneybrook, found himself consistently at odds 
with his superior about astronomical observations. It was Kinneybrook’s 
job to observe the time of transit for certain stars (this consisted, essen- 
tially, of noting the time that a star passed one of the cross wires on a 
telescope). It was found that Kinneybrook differed with his superior on 
the average by more than one-half second. Although such an amount of 
time might seem trivial for practical purposes, it was sufficient to intro- 
duce serious error into astronomical calculations. The astronomer encour- 
aged Kinneybrook to be more accurate in his observations; but in spite of 
his efforts, Kinneybrook retained his characteristic difference in observa- 
tion time (see 2, chap. 8). 

Astronomers as well as other scientists became interested in Kinney- 
brook’s difficulties. Astronomers began to compare their measurements 
with one another and found that there were consistent differences among 
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them. It became necessary then to determine an individual's “personal 
equation,” his characteristic tendency to over- or underestimate observa- 
tions by a certain amount. This brought a recognition that people differ 
in their judgments, and that such individual differences can be measured 
and accounted for in scientific work. 

The Limen of Sensation. Soon after interest had been focused on the 
personal equation, philosophers and the early psychologists began to 
speculate about the threshold of awareness, the limen. The limen is the 
point at which a visual object, or stimulus of any kind, comes into aware- 
ness. If you sit at late afternoon looking at the sky and wait long enough, 
stars will become visible. One minute you do not see them and the next 
minute you do. It is interesting to speculate about the precise moment at 
which the stars come into awareness, the absolute limen. Another exam- 
ple of the absolute limen can be made with the use of a watch, If you hold 
a watch at arm’s length, you will not be able to hear the “ticking” (unless 
you have an unusually noisy watch). As you draw the watch closer, there 
is a point at which you can hear the ticking, which is the absolute limen. 

In addition to the absolute limen, there is a difference limen, the differ- 
ence in stimulation which makes the individual aware of a difference. The 
difference limen can be illustrated with a hypothetical experiment. 
Assume that you have two light bulbs, which we will refer to as S, and Sı 
The brightness of S, is fixed at a particular level and is not changed 
throughout the experiment. Let the current be adjustable on S; so that the 
intensity of light can be varied, and then make Sı the same brightness as 
Sı. If you increase the intensity of S, making it gradually brighter, there 
will be a point at which the subject will become aware of the difference 
between S; and Sı. The difference in intensity of the two lights at that 
point is the difference limen, or as it is sometimes referred to, the JND, 
prea oo one kind of difference limen can be 
thi detache ae : vee : erent in intensity at the beginning of 
aus dade a A feel e a e gradually dimmer until the subject 
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noticeably different from a weight of 10 pounds? To carry the question 
further, would a weight of 101 pounds be “just noticeably different” from 
a weight of 100 pounds? Your everyday experience tells you that this is 
not the case, that the difference limen is dependent on the level of 
intensity of Se. 

Weber found that the JND tends to be a constant ratio of the 
intensity of S+. If, in terms of the example above, a weight of 6 pounds is 
just noticeably different from 5 pounds (the JND is 1 pound at this level 
of intensity for S+) the ratio would be 1.2. Then we would predict that 
the JND for an S; of 10 pounds would be 2 pounds (a weight of 12 pounds 
would be just noticeably different). The JND can then be expressed in 
the form of an equation as: 


JND = cS, 


The above equation, which is referred to as Weber’s law, is approximately 
true for many different situations involving judgments. It has even been 
applied analogously to such situations as the willingness of an individual 
to invest money. For example, an individual might be willing to invest 
$100 in a radio for a $3000 car but would not be willing to spend that 
much on a radio for a $2000 car—the willingness to invest more money 
tending to be a function of the amount of money already invested. 

Gustav Fechner. In the hands of Gustav Fechner, the rising interest in 
human judgment became the cornerstone for modern psychological 
measurement, Fechner went to the University of Leipzig when he was 
sixteen years old and remained there, studying, then teaching and doing 
research until his death seventy years later. Fechner went through an 
amazing gamut of careers. He took his university degree in medicine, but 
his subsequent livelihood and reputation were based on work in physics. 
He sought accomplishment as a philosopher; and, almost in spite of him- 
self, he became a great psychologist. Fechner was what we might refer to 
today as an “odd ball.” His philosophical speculations concerned the pres- 
ence of consciousness in all things, even imputing consciousness to plants 
and inanimate objects. He wrote numerous pamphlets under the pen 
name of Dr. Mises, In a heraldic tone the pamphlets called on the sleep- 
ing world to arise to the new philosophy. 

Fechner tried to support his metaphysical beliefs with scientific experi- 
ments. He seized on the studies of human judgment, and particularly the 
JND, as the starting point for his work. His research was based on the 
postulate that sensation cannot be measured directly but that it is legiti- 
mate to ask an individual whether a sensation is present or not and 
whether one sensation is more intense than another. This is still the 
logical foundation of psychophysics. Fechner set out on a long program of 
research on human judgment, and in the course of this, he invented the 
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major methods of measurement and many of the procedures of analysis 
which are used today. He performed the now classical studies on lifted 
weights, visual brightness, and the sense of touch. Although he never 
proved his philosophical points, he demonstrated how the logic and 
methods of science could be used in psychological measurement. We will 
look at some of the methods that were either developed by Fechner or 
were suggested by his work. 


PSYCHOPHYSICAL MEASUREMENT METHODS 


The Method of Limits. The method of limits was used on the previous 
pages to explain the difference limen. In using the method a test stimulus, 
S:, is compared to a variable stimulus S, One standard procedure is to 
place S, equal to S;, then increase S, gradually until the subject recognizes 
the difference. Another procedure is to make S, markedly more intense 
than S,, and then gradually lower the intensity of S, until the subject can 
no longer recognize a difference. 

The method of limits is useful in designing mechanical equipment 
which requires the operator to detect differences in stimulus intensity of 
any kind. A radar scope embodies this principle. The operator must 
detect differences in amount of light on the scope. The equipment should 
be designed in such a way that the operator can most easily detect distant 
objects, such as a formation of airplanes. 

The Method of Average Error. This psychophysical method differs from 
the method of limits in that the subject, instead of the experimenter, 
varies Sı; and the task is to equate two stimulus intensities rather than to 
judge when a difference is present. One procedure is to place S; at a 

markedly different intensity from 
S;. The subject is asked to adjust Sı 
until it appears to be the same as Sy. 
The classical use for the method of 
average error is in the study of vis- 
ual illusions, Figure 2-1 shows the 

famous Miiller-Lyer illusion. 
In Figure 2-1 the arrowhead in 
Ficure 2-1. An apparatus for measuring the middle, 2, is midway between 1 
the effect of the Miiller-Lyer illusion, and 3; however, the distance be- 
tween 1 and 2 appears greater than 
to a cord and can be moved to the 
, the method of average error can be used to 
- The arrowhead at 3 can be set far to 
adjust it to a point where the two dis- 
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tances appear equal. The experimenter can then measure the difference 
between the two lines to determine the effect of the illusion. 

The method of average error is often used to test depth perception. In 
this case two sticks are located in a frame placed about 20 feet from the 
subject. One stick, S+, cannot be moved by the subject. A pulley arrange- 
ment permits the subject to move the other stick, S, forward and back- 
ward. The task for the subject is to align the two sticks (make them be 
at equal distances from himself). A person with good depth perception 
can align the two sticks within 1 or 2 inches, whereas some persons cannot 
bring the sticks closer than a foot from one another. 

The Constant Method. Instead of using a variable stimulus, S$), to com- 
pare with a S,, the constant method uses a fixed set of stimuli, S1, S2, Ss, 
and so on, to compare in turn with one S;. In laboratory work the number 
of comparison stimuli ranges from four to more than eight. In an experi- 
ment with lifted weights, six comparison stimuli could be used. If S; is a 
weight of 200 grams, three of the comparison stimuli would weigh more 
than S, and the other three would weigh less than S;. The subject would 
be presented one of the comparison stimuli and asked to judge whether 
it is heavier or lighter than S+. Next, the subject would be given another 
of the six comparison stimuli and asked to judge whether it is heavier or 
lighter than S,. The subject lifts S, with one hand while holding the com- 
parison stimulus in the other hand. Typically, the comparison stimuli 
would be given to the subject in a random order, to remove any unneces- 
sary cues for judgment. Also, the subject would alternately lift S; with the 
right and left hand to remove this source of bias. After the subject had 


Ficure 2-2, Psychometric function for the application of the constant method to the 
study of lifted weights. 
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made many such comparisons, perhaps several hundred, the results could 
be pictured like that in Figure 2-2. 

Figure 2-2 shows, as you would expect, that the per cent of answers 
“greater” grows as weights above 200 grams are considered, and that the 
percentage declines as weights less than 200 grams are considered. The 
subject hardly ever says that a weight of 185 grams is greater than S, and 
almost always says that a weight of 215 grams is greater than S,. A curve 
drawn through these points on the graph shows the typical psycho- 
physical function. 

The constant method can be used in a wide variety of psychophysical 
studies. Not only is it a useful way of determining thresholds, but the 
shape of the psychophysical function with different kinds of judgments 
is of interest. 

The Method of Rank-order. In this method a number of stimuli are 
ordered in respect to an attribute. No one of the stimuli is specifically 
designated as the S,. A simple example is to give an individual a number 
of objects and ask him to order them from heaviest to lightest. The 
method of rank-order is most frequently used when there are no meas- 
urable scales for the stimuli, as would be the case if an individual were 
asked to rank-order a number of foods in terms of preference. There 
would then be no intrinsic property of “likability” that could be measured, 
and consequently, the problem would be to determine the preference 
ordering for the one person. There would be no way of determining how 

accurate” the individual is, as would be possible if lifted weights or 
visual intensities were being studied, 

The method of rank-order is used in many studies of interests, attitudes, 
and preferences, The subject is asked to rank-order a number of activities 
in terms of his interests, a number of acquaintances in terms of prefer- 
meee oe articles in terms of what he would purchase. The 
ae ae in Sd of human ability, in which the individual is 
test ae ica number of alternative answers to a question. Also, 

a mes reported as ranks, the person making the highest 


grade receiving a rank of 1, the second highest a rank of 2, and so on for 
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peaches, and so on until apples are compared with each of the other 
fruits. There is a resemblance to the constant method in that apples are 
compared to a number of other stimuli. However, in pair comparisons, 
the subject is then asked to say whether he prefers oranges to peaches, 
and he continues to compare oranges with each of the other fruits. In all, 
the subject must make n(n — 1)/2 judgments, where n is the number of 
stimuli being studied. If there are 10 fruits in the example, there will be 
45 comparisons to be made. 

The method of pair comparisons is usually considered the most exact 
psychophysical tool for many laboratory studies. It can be used to provide 
much the same information that can be obtained by the other methods; 
and, because of the many judgments which the subject must make in pair 
comparisons, the results are usually more reliable. Also, the method gives 
an indication of the definiteness with which the subject prefers some of 
the stimuli more than others. If an individual were to make pair compari- 
sons among objects which he preferred equally or which he judged to be 
nearly equal in terms of an attribute, each object would be chosen about 
as many times as those with which it was compared. Such information 
could not be obtained in the method of rank-order, where the subject is 
forced to give a different rank to each stimulus. Due to the laboriousness 
of the pair comparisons procedure, its use is restricted to situations in 

hich it is necessary to obtain rather exact information about a set of 
preferences or judgments. 

The Place of Psychophysics in Psychology. We have had only a brief 
look at Fechner’s contributions and some of the psychophysical methods 
which grew from his work. There are numerous other special methods, 
variants of the basic procedures, which are useful in studying particular 
problems. It is important to note that, as in nearly all the early work, in 
most current psychophysical studies, investigators are not primarily con- 
cerned with learning about individual differences. Instead, the purpose 
is to find laws about human judgment and preference that apply to 
everybody. In so far as large differences among people appear in psycho- 
physical studies, they only serve to complicate the general lawfulness that 
is being sought. 

In addition to the rich tradition of psychophysical research which still 
goes on in studies of human judgment, the psychophysical methods are 
used widely in physiological psychology, studies of perception, and the 
psychology of learning. Our particular concern is with the impact of 
psychophysics on the study of individual differences. Not only are the 
psychophysical methods used now in studies of individual differences, 
but the firm logical foundation of human measurement has been brought 
over into the development of psychological tests. We will not go into 
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detail about psychophysics, per se, in the pages ahead (see 5 for a detailed 
description of the psychophysical methods). An introduction to the sub- 
ject was given here to show the wide range of problems that require 
psychological measurement methods and to place in perspective the atten- 
tion which will be given to the measurement of individual differences in 
subsequent chapters. 5 


THE GROWTH OF STATISTICS 


The construction and use of psychological tests are intimately con- 
nected with statistical concepts and statistical techniques. Therefore, it 
would be helpful to discuss some of the foundations for statistical methods 
before they are applied to testing problems in subsequent chapters. 2 

The impetus for the development of statistical methods came initially 
not from the budding scientific disciplines but from the needs of gamblers. 
The gamblers needed some principles for the setting of betting odds in 
roulette, dice, and card games. Men such as Bernoulli, De Moivre, 
Laplace, and Gauss laid the foundation for our modern systems of mathe- 
matical statistics during the time between the middle of the seventeenth 
and the middle of the eighteenth century. : 

The central notion in the discipline of statistics is that of probability. 
Previous to the development of statistics, scientific laws had been ex- 
pressed in a way that allowed no room for errors in prediction. The law 
of falling bodies, the predictions of celestial movements, and the laws of 
heat exchange were intended to predict events in nature exactly. How- 
ever, in actual experiments the results seldom come out exactly as pre- 
dicted. Such inconsistencies were explained in terms of the inability to 
achieve purified laboratory conditions, which require complete vacuums, 
noncompressible fluids, completely constant temperature, and other ideal 
circumstances, Later it came to be realized that such ideal conditions 
could never be obtained, and consequently, that the results of experi- 
ments could be only approximately predicted. In some fields of research 
we have seen a succession of laws, each of which came closer to explain- 
ing a scientific problem. Scientists have come more and more to look on 
all laws as approximations, approximations that fit the real world so 
closely as to be “true” for all practical purposes, 

The need for a consideration of probability is more necessary in deal- 
ing with some phenomena than it is when dealing with others, Even rela- 
tively crude scientific principles will serve us in everyday life, in problems 
such as constructing a canoe to hold a certain amount of weight, setting 
up a system of gears to lift tree stumps, and determining the amount 
of water that will be needed to fill a swimming pool. But we can sce why 
gamblers were in great need of the statistics of probability. The individual 
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flip of a coin or throw of dice will obey not even an approximate lawful- 
ness. With events like these, it can only be said that one outcome is more 
likely, or more probable, than another. Because of the randomness of the 
individual event, statistical laws are true only on the average. For exam- 
ple, if we flipped a coin 1,000 times, we would expect to obtain approxi- 
mately 500 “heads” (although a “biased” coin might fool us here). 

In many scientific situations, predictions can be made only about aver- 
ages or total effects rather than about individual events. For example, the 
laws of gases are true only on the average. It can be predicted how a 
gaseous body will behave under changes in temperature, but it cannot be 
predicted how the individual molecule will behave. As the temperature 
is increased, the molecular movement increases, on the average. However, 
some of the molecules might slow up for a time because of collisions with 
others, and some of the molecules might accelerate beyond the average. 
But when the behavior of nearly countless molecules is averaged out, pre- 
dictions become very accurate for all practical purposes. 

The social sciences are particularly dependent on concepts of prob- 
ability. For example, a principle about the behavior of “mobs” might hold 
only for the action of the group as a whole. It might be predicted that the 
amount of shouting for the group as a whole would increase. However, 
assuming some credence for this hypothetical principle, it is quite likely 
that some of the group members would actually become quieter as the 
general noise level increased. 

In addition to predicting the average performance for a number of 
events or for a number of persons, it is equally important to learn about 
the amount of error which such predictions entail. Thinking back to the 
problems confronting the seventeenth-century gamblers, the errors in- 
volved in the prediction of coin tosses can serve to illustrate the problem. 
If an individual tossed 10 coins on a table, what are the odds that eight 
of them would be heads? If the 10 coins were tossed 1,000 times, how 
many times would all ten of them be tails? The expected occurrences for 
10 coins tossed 1,024 times are shown in Table 2-1. We would expect to 
get 10 heads, and consequently no tails, in 1 toss out of 1,024 tosses. As 
would be expected, the most frequently occurring pattern is 5 heads and 
5 tails. 


TABLE 2-1. Expectep Occurrences oF “Heaps” AND “Tarts” For 10 Coins 
Tossen 1,024 Times 


Frequency 1 10 45 120 210 252 210 120 45 10 1 


9 10 
1 0 


Heads 0 1 


3 4 
Tails 10 9 6 


5 6 
5 4 
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The distribution of heads and tails is pictured in graph form in Fig- 
ure 2-3. If the number of coins in each toss is increased from ten to one 
hundred, the curve will begin to smooth out to resemble that in Figure 2-4. 


Ficune 2-3. Graph of expected occurrences of “heads” and “tails” for 10 coins tossed 
1,024 times. 
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Ficune 2-4. Smoothed curve showing expectancies of “heads” for a large number of 
coins tossed many times. 


Frequency 


No. of heads 


The bell-shaped curve in Figure 2-4 is encountered very often in prac- 
tical work. It is referred to as the normal distribution, or it is variously 
aue the normal curve and the normal curve of error. Much of the early 
pavers of statistics was concerned with the properties of the normal 
ee j Normal Distribution. During the last hundred years 
ree s have been made to apply statistical principles to the 

“ avior o people. In the coming chapters we will see the key place that 
i ei rae ee in the development and use of tests. During the 
m a i: e nineteenth century, Adolph Quetelet, a Belgian statis- 

ian, undertook to gather a considerable amount of information about 
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European populations. He collected information on the number of chil- 
dren in families, the number of children born in different years, and the 
physical characteristics of people. He found that many of these character- 
istics distribute themselves in a shape much like the normal distribution. 
For example, he found that the chest measurements of soldiers were dis- 
ributed like the normal distribution. 

Because Quetelet knew that the normal distribution demonstrates the 
rror in many types of predictions, he came to the conclusion that a nor- 
nal distribution of human traits shows “nature's error” in the composition 
f human beings. It was his idea that “nature” aims at producing the 
ıverage man, and that the extremes in either direction are “accidents of 
nature.” We remember Quetelet not so much for this principle as for the 
influence he had on men to follow. He was one of the first persons to 

make systematic studies of individual differences, and he interested others 
in the application of statistical methods to the study of human behavior. 


THE EFFECT OF THE THEORY OF EVOLUTION ON PSYCHOLOGY 


Although the theory of evolution had its most direct influence on the 
science of biology, it also had an important impact on psychological meas- 
urement and psychology as a whole. Before the nineteenth century, the 
predominant view was that man is a static being, having since the day 
of creation a uniform and unchanging set of physical and mental attri- 
butes. Although Quetelet measured individual differences, he thought of 
these as “nature’s mistakes” in producing the average man. It had always 
been noted that men differ from one another, but there had been no sys- 
tematic attempt to study the part which individual differences play in 
everyday life. 

Darwin. There were dissenters to the static theory of man and his sur- 
roundings, and there were even some, men such as Lyell and Lamarck, 
who theorized that life changes from generation to generation. It re- 
mained for Charles Darwin to amass the evidence to support the theory 
of evolution. He collected many examples from the animal kingdom show- 
ing the changing forms of life that develop in special environments. He 
saw that the mechanism of selection, or “survival of the fittest” as he called 
it, allows some animals to exist and forces others to die off. Therefore, the 
animals whose particular features make them more suited to an environ- 
ment survive and pass on their genetic qualities to the next generation. 

Darwin’s major evidence for evolution was given in The Origin of 
Species published in 1859, It brought on several decades of argument 
among scientists as to the validity of the theory and the implications of 
the theory for scientific work. As the theory came to be accepted, it 
forced new viewpoints on scientific disciplines. It required biologists and 
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botanists to think of animals and plants as being in a state of change, 
gradually progressing from one form to another to meet the changing 
environments. The theory also gave an impetus to the study of individual 
differences in psychology. It was reasoned that, if individual differences 
in plants and animals make for differential ability to adapt and survive, 
individual differences in humans would also have a functional signifi- 
cance, Further, if plants and animals inherit ancestral charactcristics, 
some of the individual characteristics in humans might be accounted for 
on the same basis. 


EARLY STUDIES OF INDIVIDUAL DIFFERENCES 


Galton. The several lines of thought that we have traced—psycho- 
physics, statistics, and the theory of evolution—converged in the psy- 
chological work of Sir Francis Galton. He was a member of a vanishing 
“breed,” the gentleman scholar who, without formal university connec- 
tions, dabbled effectively in a wide range of subjects. His interests ranged 
from criminology, psychology, biology, and anthropology, to the inven- 
tion of numerous scientific devices. He invented the “supersonic” whistle 
and delighted himself by calling the dogs of London away from their 
incredulous masters. He invented composite photography, a procedure 
for combining a number of photographs into an average human appear- 
ance. He applied the method to the photographs of criminals, trying to 
obtain a composite of typical criminal features. He once practiced para- 
noia, going about imagining that everyone was talking about him. After 
several days he could imagine that even horses were plotting against him. 
In order to understand primitive religions, he built a religion of his own. 
He placed a comic picture of Punch on the wall and paid it all manner of 
homage and tribute in order to grasp the meaning of the idol in primitive 
worship. In addition to his many other activities, Galton also applied his 
genius to the study of psychological traits. 

Galton was Darwin’s immediate follower 
developed an interest in the heredit 
and made studies of prominent men 
1869 he reported his findings in Her 
show that “greatness” runs in fami 
important personal characteristics are inherited, not only the prominent 
physical characteristics, but also abilities and personality characteristics. 
He believed that criminality and psychological disorders are passed along 
from father to son by direct inheritance, 

Galton founded the study of eugenics, the 
the betterment of the human race through the 
Galton never worked out the social problem 


in the field of psychology. He 
ability of individual characteristics 
in England to support his views. In 
editary Genius, in which he tried to 
lies. He came to believe that most 


avowed purpose of which is 
control of mating. Although 
s which his program would 
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involve, he was convinced that a world of superior humans could be 
developed by encouraging gifted couples to marry and by discouraging 
or preventing less gifted individuals from having children. 

Galton realized that, before his program of eugenics could be under- 
taken, it would be necessary to learn more about the inheritance of indi- 
vidual characteristics. If it could be shown that a particular attribute, 
literary ability for example, is passed from father to son, then it should be 
possible to breed a generation of literary geniuses. However, if the attri- 
bute proved not to be inherited, there would be nothing that a program of 
eugenics could do to improve literary talent. Galton needed first to meas- 
ure human characteristics before proceeding to study their hereditability, 
and it is because of this need to measure individual differences that Gal- 
ton’s contributions are of such importance to us here. 

Galton coined the term “mental test” and set about to measure many 
different human attributes. He recognized the need for standardization in 
testing, that all subjects should be presented the same problems under 
uniform conditions. Galton’s tests bore little resemblance to the ones 
which are most widely used today. He and his immediate followers in 
England made much of the philosopher Locke’s dictum that all knowledge 
comes through the senses. They reasoned from this that the person with 
the most acute “senses” would be the most gifted and most knowledge- 
able, Most of Galton’s tests were measures of simple sensory discrimina- 
tion: the ability to recognize tones, the acuteness of vision, the ability 
to differentiate colors, and many other sensory functions. Galton also 
invented many instruments to use in his research, some of which are used 
in modified form today. 

Galton began the first large-scale testing program at his Anthropometric 
Laboratory in the South Kensington Museum in 1884, Each visitor was 
charged threepence for having his measurements taken on a variety of 
physical and sensory tests, including height, weight, breathing power, 
strength of pull, hearing, seeing, and the color sense. In order to analyze 
the data obtained, Galton made use of statistical methods. With these, 
he determined averages and measures of variation. His particular need 
was for a measure of association, or correlation, to detect the amount of 
resemblance between the individual characteristics of fathers and their 
sons. He made the first attempts to derive the statistics of correlation, the 
procedures which are now widely used in the study of individual 
differences. 

Pearson, Galton supported a younger colleague, Karl Pearson, in the 
development of statistical methods for the study of individual differences. 
Pearson was the genius in mathematical statistics that Galton was in 
studies of individual differences. He derived the correlation coefficient, 
partial correlation, multiple correlation, factor analysis, and laid the foun- 
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dation for most of the multivariate statistics which are in use in psychol- 
ogy today (some of which will be seen in the chapters ahead ). 

Following the work of Galton, the last two decades of the nineteenth 
century saw a considerable expansion of studies of individual differences, 
in England and America. Prominent in this work was James McKeen 
Cattell, who studied with Galton and then undertook investigations in 
America. The tests continued to be largely of a sensory and motor type, 
measuring sensory speed and reaction time. Toward the end of the cen- 
tury, it came to be realized that these tests were not measuring intelli- 
gence. Scores on the tests bore almost no relationship to measures of 
intellectual achievement, such as school grades, refuting the hypothesis 
that sensory ability is closely related to intelligence. 


THE FIRST PRACTICAL MENTAL TESTS 


A brief summary will be given of the major developments in psycho- 
logical testing during this century. Each type of test which is mentioned 
will be discussed in detail in later chapters. 

At the turn of the twentieth century, some notable events in the history 
of psychological measurement occurred in France. It was there that the 
humanist tradition and the interest in medicine and social welfare were 
centered. After the French Revolution, terms such as “equality” and “the 
rights of man” showed the new interest in helping the downtrodden, the 
sick, and also the psychologically maladjusted. Medicine developed rap- 
idly during the first part of the nineteenth century, partly because of the 
Napoleonic Wars and the need to treat wounded soldiers. Pinel released 
the insane from their dungeons and insisted that they were sick, not pos- 
sessed by demons. Charcot, Janet, and Ribot founded the field of psychi- 
atry and developed the first plausible theories of psychopathology. Freud 
drew his early knowledge from these men and went on to be the founder 
of psychoanalysis. 

Binet. During the last quarter of the nineteenth century, the distin- 
guished line of French investigators was joined by Alfred Binet. His early 
work was on the use of hypnotism in treating mental disorders; then near 
the end of the century, he turned to the problem of measuring intelli- 
gence. He was commissioned by the Minister of Public Instruction to con- 
pai tests for the measurement of intelligence in school children. The 
sana ghee alarmed by the number of failures and needed some 

istinguishing the lazy or maladjusted from the children who 

lacked the fundamental capacity to learn. 
ioe ` i helene with Simon, completed his first test in 
principally concerne 


A d with sensory and motor 
functions as had been the case with the tests 7 


of Galton and his immedi- 


The Growth of Measurement Methods in Psychology 29 


te followers. Instead, it concerned the child’s ability to understand and 
reason with the objects in his cultural environment. Consequently, the test 
items were such as the naming of objects, the completion of sentences, 
and the comprehension of questions. 

The first Binet-Simon test had a range of questions from very simple 
nes, which even the dullest child might answer correctly, to questions 
which would indicate a superior level of ability. Practical work with the 
test indicated that some of the items were more difficult than had been 
anticipated, and some of the items proved to be insensitive to differences 
in ability. A revision of the test was made in 1908, in which the items were 
wranged in terms of age levels, with a group of items being designated as 
representing average intelligence for each particular age. The highest age 
level at which a child could perform adequately was called his men- 
‘al age. Later, Stern suggested that the mental age be divided by the 
child’s chronological age, which is essentially the IQ as it has come to be 
known. 

The Binet-Simon tests aroused considerable interest in other countries, 
in England and especially in America. Several translations and revisions of 
the tests were made in America, including forms by Goddard (4) and 
Kuhlmann (6). The tests became very popular in studying school chil- 
dren, particularly in the understanding of psychological maladjustment 
and juvenile delinquency. The most prominent revision of the Binet- 
Simon test was completed by Terman (9) in 1916, the resulting form 
being referred to as the Stanford-Binet test. For the first time a consider- 
able amount of research was done to select items which would be appro- 
priate for each age level and which would differentiate age levels. Also, 
the large number of children which was tested provided norms for the 
interpretation of test scores. 

Terman and Merrill presented a revision of the Stanford-Binet in 1937 
(10). A truly heroic effort was made to construct a standardized measure 
of general ability. Several thousand children were tested in all parts of 
the United States to establish norms. This test has been used more fre- 
quently than any other comparable instrument and is still in wide use 
today. (The Stanford-Binet will be discussed in detail in Chapter 10.) 


THE FIRST GROUP TESTS OF ABILITY 


The Binet forms and their revisions are all individual tests, requiring 
an expert examiner to test one child at a time. During the First World 
War there was a need for classifying large numbers of recruits, to weed 
out men whose level of ability was so low that they would be a handicap 
to the service, and also to select more intelligent individuals for positions 
of responsibility. A special committee of the American Psychological Asso- 
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ciation, with R. M. Yerkes as chairman, began an investigation of testing 
procedures for the armed services. 

Arthur Otis had been working on a group test of “intelligence,” which 
he turned over to the committee. With some revisions, it was put to use. 
This test is referred to as the Army Alpha Test (7). It was a short form 
containing questions on arithmetic, reasoning, and general information. 
The test proved very successful, came into wide use in the armed forces, 
and is still used to some extent today. A companion test, the Army Beta 
(8), was constructed for use with individuals whose knowledge of Eng- 
lish is scanty. The test instructions are given in pantomime, and the test 
items do not require the subject to use or understand English. The test 
requires the subject to solve geometrical puzzles and to analyze pictures. 
Although the Beta test was not so successful as the Alpha test, it set the 
stage for the nonlanguage tests which are in use today. 


STANDARDIZED CLASSROOM EXAMINATIONS 


The Essay Exam. The standardized classroom examination as we know 
it today came into wide use only during the last fifty years. Before that 
time it was customary for the instructor to give grades on the basis of his 
accumulated impressions of the student or on the basis of oral examina- 
tions. Grading in this way can be unfair to the student who is not popular 
with the teacher or who does not have the verbal facility to impress 
others. The first standardized examinations were of the “essay” type, in 
which the student writes about a number of problems or topics. Although 
grading is still somewhat subjective, the “essay” examination is standard- 
ized to the extent that all students receive the same questions, providing 
a uniform basis of comparison. 

The Multiple Choice Examination. Progress in classroom examinations 
has moved in two directions, in the refinement of essay examinations and 
in the development of multiple choice tests. The advent of the multiple 
choice, or “objective” test, as it is sometimes called, ruled out the unreli- 
ability of instructors’ impressions; and although there are abuses in the 


use of multiple choice tests, examinations of this kind are at least uni- 
formly fair to all students. 


Standardized Achievement Tests. A more recent tre 
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THE DEVELOPMENT OF PERSONALITY TESTS 


In line with the sequence in which tests were developed, we have 
talked so far about tests of ability, which concern how much an individual 
knows, how rapidly he can work, or how well he can solve problems. 
Tests of this kind are sometimes referred to as “cognitive” tests and have 
been contrasted with the variety of attributes which are “noncognitive” or 
not abilities in the strict sense. Cronbach (3) has distinguished these two 
kinds of attributes as being tests of “maximum performance” as opposed 
to tests of “habitual performance.” You can see how these functions would 
differ if we were considering a test for “friendliness.” It would not be a 
case of how friendly an individual could be if he tried, because most of 
us know how to be friendly. The question would be that of how friendly 
in individual habitually is in his dealings with others. Another synonym 
for noncognitive tests or tests of habitual performance is personality tests. 
\lthough the term has come to be used far too broadly to cover interests, 
attitudes, mental health, and many other matters, it is too firmly in use 
to be avoided. 

Tests of ability were developed first and more firmly not because of a 
lack of interest in personality but because of the difficulty in measuring 
the noncognitive functions. Galton was as interested in the inheritance 
of personality as he was in the inheritance of abilities. However, the 
nearest he came to measuring personality characteristics was in a ques- 
tionnaire study of “imagery,” concerning the extent to which individuals 
can picture objects in memory. Binet was interested in tests of personality 
before constructing the first “intelligence” test, and among other devices, 
he designed a questionnaire test of “morality” for children. 

Self-report Inventories. Psychologists were not only busy during the 
First World War developing “intelligence” tests, but they also tried to 
meet the need for tests of neuroticism and emotional instability. The num- 
ber of recruits was too large to be seen individually in psychiatric inter- 
views. R. S. Woodworth constructed a questionnaire which asked each 
recruit essentially the kinds of questions that would be used in an inter- 
view. The questions concerned the individualľ’s adjustment in the home, 
school, and among friends. It was used primarily to sift out those men who 
needed further clinical study, but not, as similar inventories were often 
used afterward, to provide “measures” of personality. All the difficulties 
involved in the use of this instrument still continue to hinder the devel- 
opment of personality tests: the individuals inability or unwillingness to 
describe himself accurately and the difficulties in finding a valid set of 
items for use with even the most cooperative subjects. 

Projective Tests. During the First World War, a Swiss psychiatrist 
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named Hermann Rorschach was working on a novel line in tle measure- 
ment of personality. The Rorschach Test (see 1), as it has come to be 
known, consists of 10 ink blots which the subject is required to describe 
and interpret. The test grew out of Freudian and other “depth” psycholo- 
gies with their emphasis on unconscious motivation and the importance 
of symbolism. The purpose of the test is to get beyond what the subject 
knows about himself, and is willing to relate, to the “deeper” traits and 
urges which determine his overt behavior. Since Rorschacli’s pioneer 
research, numerous other projective techniques have been developed. In 
spite of the great difficulties in standardizing tests of this kind, the 
Rorschach and its followers have become the mainstay of the clinical 
psychologist. 


THE PERIOD BETWEEN THE TWO WORLD WARS 


As wars have so often been the gloomy milestones in historical devel- 
opments, the direction of the measurement movement in psychology has 
changed with the two major conflicts of this century. We saw how the 
practical problems of the First World War encouraged the development 
of the first group tests of “intelligence” and personality. The products 
were inexpensive, convenient tests which could be applied to large 
numbers of people. Consequently, group tests came into vogue during 
the 1920s and 1930s and, unfortunately, were often applied uncritically 
to many problems. There was too little concern for testing the test, too 
little concern for determining the reliability and validity of the measures. 
Teachers often took the results of “intelligence” tests at face value and 
made major decisions about students on the basis of questionable meas- 
ures. The general public became oversold on tests, and the JỌ came to be 
regarded as an infallible and fixed mark on the individual. Numerous tests 
of personality purporting to measure “introversion,” “marriage adjust- 
ment,” “happiness,” and many other complex attributes were constructed 
and sold with little research found 
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Instead of having to construct only several group tests as had been the 
case in the previous conflict, psychologists were called on to construct 
numerous batteries of tests, some of them complicated beyond the ex- 
tremes of what the early testers might have imagined. Psychologists were 
brought into the problem wholesale and spent several years in implement- 
ing the largest testing program that had ever been undertaken. In the 
process, the logic and methods of psychological measurement were con- 

iderably extended, and many researches were undertaken which would 
not have been possible without government support. The psychology of 
testing was pushed years ahead by the work which was undertaken in 
conjunction with the armed forces. Much of the work was methodological, 
concerning the best way to measure particular functions, in contrast to 
the pell-mell effort of earlier decades to compose tests without proper 
research foundation. 

Now we find that the people who construct tests and the people who 
use them are more aware of the limitations of particular measures, more 
inquisitive about the research findings on an instrument, and more mind- 
ful of alternative approaches. A healthy skepticism has grown, and with 
it have come promising new lines of investigation in the measurement of 
individual differences. Testing has become a big business, with numerous 
firms devoting themselves to the development and sale of tests. Hundreds 
of psychologists and professional educators spend most of their time as 
measurement specialists. Psychological publications have hundreds of 
articles each year on specialized problems in measurement. However, 
there are still abuses in testing—tests being sold without proper research 
and a naïve acceptance by some persons of any scrap of paper which 
purports to be a test. In spite of the remaining abuses and in spite of the 
unsolved problems in many kinds of tests, psychological tests can now 
point to an increased efficiency and marked economy which they have 
brought to schools, the Armed Forces, psychological clinics, and more 
recently, to industry. 
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CHAPTER 3 


The Use of Scores and Norms 


Test results are usually reported in some numerical form, for example, 
the total number of questions answered correctly on a true-false test. The 
numbers obtained in this way, before they are converted to any other 
form, are referred to as raw scores. Raw scores are seldom directly mean- 
ingful without some qualification as to how well other persons do or as to 
the established standards of performance. If Johnny tells his mother that 
he has made 22 in arithmetic and 48 in spelling, the mother might say, 
“That’s good work in spelling, but we must do something about your arith- 
metic.” But the 22 in arithmetic might have been the highest score; and if 
the spelling test consisted of 100 words, and the teacher expected the stu- 
dents to know the majority of them, a score of 48 would be an indication 
of poor performance. 


MEASURES OF CENTRAL TENDENCY 


The first thing that we need to learn about Johnny’s performance is how 
well the students as a group performed on the two tests. This will tell us 
whether Johnny is above or below average. To simplify the problem, let 
us assume that there are only 11 students in the class and that they made 
scores on the two tests as follows: 


Arithmetic Spelling 
Johnny 22 48 
Fred 12 52 
Mary 14 49 
Bill 12 51 
Jane 14 55 
Susan 14 52 
Michael 17 50 
Sharon 19 62 
Harry ll 56 
Patricia 15 52 
Eric 20 15 


By looking at the list of scores, it can be seen that Johnny did very well 
in arithmetic in comparison to the other students, but relatively poorly in 
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spelling. However, Johnny’s mother would probably not have an oppor- 
tunity to look at the list of scores; and it would be a clumsy way to make 
comparisons if it were necessary to pass around the list to everyone who 
was interested in the test results. 

Mode. Some index is needed to let the mother know at once whether 
Johnny did as well as the other children. One such index, called the mode, 
is obtained quite simply by finding the score made most frequently by 
the students. In the arithmetic test, more students made a score of 14 
(Mary, Jane, and Susan) than any other score. The mode is 52 on the 
spelling test. Although the mode is not as useful as some other measures, 
it is sometimes used in testing as a measure of central tendency. 

Median. Another measure of central tendency is obtained by seeking 
the score which has an equal number of students above and below that 
point. In the arithmetic test, the median is 14. Of the 11 students, five 
made scores higher than 14 and six made scores of 14 or less. The median 
is 52 on the spelling test, and there are three students whose scores fall 
exactly at the median. If there is an even number of scores, if we show the 
scores for 10 students instead of 11, the median falls between two scores. 
If Fred had not taken the arithmetic test, the median would have fallen 
between 14 and 15. The usual practice is to split the difference and say 
that the median is 14.5. If Johnny’s score of 48 were not considered on the 
spelling test, the median would again lie between two scores, both of 
which are 52. The median would then still be 52. 

The median can be thought of as 


< the score made by the “average per- 
son.” This is different, as will be sh 


own shortly, from the average score. 
One advantage of the median is that it is easy to understand. The teacher 
would find it relatively easy to explain this index to Johnny’s mother, and 
it would prove useful in discussing Johnny's class standing. 

The reason that the mode and the median are not used more is that they 
cannot serve as a basis for the derivation of other statistics which are 
needed. If we start off with certain statistical measures instead of others, 


it greatly simplifies the mathematical work to follow; and the mode and 
median are poor starting points. 


Mean. A measure of central tendency can be obtained which is both 


easy to understand and can be used in nume 
developments. The mean is o 
and dividing the sum by the 
of the symbolism which will 
made on a test, N for the nu 
the formula for the mean is 


rous other mathematical 
btained by adding all of the scores on a test 
number of persons tested. Introducing some 
prove useful, let X stand for the raw scores 
mber of subjects, and M for the mean. Then 


=X 
a 
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Fhe summation sign (3) stands for the process of adding scores, the 
res of Johnny, Fred, Mary, and so on, for the 11 students. 

The symbolism can be made more specific by the use of subscripts to 
show the test on which the mean is being obtained. Letting the arithmetic 
test be referred to as test 1, the formula for the arithmetic mean can be 

iore completely specified as 


Different subscripts can be used to designate other tests, such as using the 
subscript 2 in referring to statistical operations on the spelling-test scores, 
and the subscripts 3, 4, 5, and so on, for other tests. 

Applying the formula to the students’ grades, a mean of 15.45 is found 
for the arithmetic test and a mean of 54.73 for the spelling test. A com- 
parison can be made of the three measures of central tendency on the 
two tests as follows: 


Arithmetic Spelling 
Mode 14 52 
Median 14 52 
Mean 15.45 54.73 


The three measures give similar results on both tests. Note that the mean 
on the spelling test differs by 2.73 from the median and the mode. Look- 
ing back at the scores, it can be seen that seven of the eleven people made 
less than the mean on spelling. This is due to Eric’s performance in spell- 
ing, a score so divergent from the others as to affect the mean unduly, It 
is in situations like this that the mean can cause confusion in the interpre- 
tation of test scores. 

When either one person or a small group of persons makes scores which 
are markedly higher or lower than the rest of the group, the median is 
often more useful than the mean for interpreting test results. Fortunately, 
such extreme cases as the example above are not often found in practice. 

Deviation Scores. As a first step in transforming raw scores to a more 
useful form, each score can be expressed in terms of its distance from the 
mean. The transformed scores are referred to as deviation scores and are 
symbolized by a lower-case x: 


x= X-—M. x 
The formula states that each person’s deviation score is obtained by sub- 


tracting the mean from his raw score. The class grades can be transformed 
to deviation scores as follows: 
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Arithmetic Spelling 
xı Xa 
6.55 —6.73 
—3.45 —2.73 
—1.45 —5.73 
—3.45 —3.73 
—1.45 27 
—1.45 —2.73 

1.55 —4,73 

3.55 7.27 
—4.45 1.27 
o] —2.73 

4.55 20.27 

Deviation scores tell whether an individual is above or below average. 


Because Johnny’s deviation score in arithmetic is positive, we know that 
he is above the mean; because his deviation score is negative on the spell- 
ing test, we know that he is below the mean. 


MEASURES OF DISPERSION 


Before we can interpret a particular deviation score, we must learn 
how widely the scores are scattered above and below the mean. A devia- 
tion score of 2.00 would represent superior performance if all of the 
scores are closely packed about the mean. But if there are deviation 
scores that go as high as +100 and as low as —100, a deviation score of 
2.00 would indicate near average performance, Consequently, we need an 
index of the spread, or scatter of scores about the mean in order to inter- 
pret particular deviations, 

The Range. There are variou 
tered, or of the dispersion, as it 
range, is obtained by subtractin 
The highest score on the 


s indices of how widely a group is scat- 
will be called. One very simple index, the 
g the lowest score from the highest score. 
arithmetic test is 22 and the lowest is 11. This 
gives a range of 11. The range on the spelling test is 27, showing that the 
dispersion of scores is greater than that in arithmetic. The range is a 
quickly obtained and often used index of dispersion. However, it lacks 
some of the properties that are needed for an acceptable measure. It is 
dependent on only two scores, the highest and the lowest. If Eric had not 
taken the spelling test, the range would be only 14 instead of 27, Also, the 
range lacks the mathematical properties which can lead to the develop- 


ment of other statistics (a point to which we will appeal quite often in 
choosing statistical measures), 
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The Average Deviation. An index of dispersion which is dependent on 
ill of the scores instead of just two of them and which indicates the posi- 
tion of an individual in a group is the average deviation. As the name 
implies, it is obtained by finding how much the scores deviate on the 
iverage from the mean, as follows: 


Arithmetic Spelling 
|x| |x| 2\x1| = 32.35 
6.55 6.73 32.35 
3.45 2.73 aT 
1.45 5.73 
3.45 3.73 - 
$ AD = 2.94 
1.45 27 
ont ats 2|x:| = 58.19 
1.55 4.73 58.19 
3.55 7.27 AD = 
4.45 1.27 11 
45 2.73 
4.55 20.27 AD = 5.29 


The symbol |x,| indicates that we are dealing with absolute deviations, 
paying no attention to the signs. 

One method of transforming deviation scores to a more useful form is 
to divide each of them by the average deviation (AD). This conversion 
will give Johnny a score of 2.23 in arithmetic and —1.27 in spelling. When 
the distribution of scores is normal, scores transformed in this manner will 
lead to a precise indication of where an individual stands in a group. 
However, developing this measure further will not be worth the effort. 
There are more desirable measures of dispersion which can be used; the 
range and AD have been discussed to provide a background for the meas- 
ure now to be developed. 

The Standard Deviation. The AD has a serious fault: it is based on 
absolute scores. It is very difficult to work mathematically with absolute 
scores; and consequently, if the AD is used in some of the early statistical 
work, it severely limits the development of other measures. However, we 
will also find ourselves blocked if we seek a measure of dispersion by 
working with the deviation scores. The sum of these is always zero in any 
set of scores. Therefore, equations based on the sum of deviation scores 
“fall apart” and leave nothing with which the mathematician can work. 

An alternative to using either x scores or |x| scores is that of working 
with the squared deviations. These will all be positive, and it also hap- 
pens that they provide an excellent starting place for the derivation of 
many other statistics. The squared deviations on the two tests are 
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xi Xz 
42.90 45.26 
11.90 7.45 
2.10 32.83 
11.90 13.91 
2.10 07 
2.10 7.45 
2.40 22,37 
12.60 52.85 
19.80 1.61 
20 7.45 
20.70 410.87 
=x? = 128.70 Zx = 602.12 


The mean square deviation can be obtained by summing the squared 
deviations and dividing by the number of persons who took the test. This 
statistic is called the variance and is symbolized as o*: 


2 
pa 
N 

An even more useful statistic is obtained by taking the square root of the 


variance, which is then called the standard deviation: 


ae 
ay 
N 


Applying the formula to the arithmetic and spelling tests, the following 
variances and standard deviations are found: 


Arithmetic Spelling 
,_18870 . 60212 
= ¢= 


$ Il 


11 

a = 11.70 o = 54.74 

o= 3,42 o= 7.40 
Subscripts can be used with the variance and stand 
to indicate which test is bein 
ard deviation of the scores o 
The standard deviation a 


going through the step of 
follows: 


ard deviation formulas 
g studied, like using ø; to refer to the stand- 
n a particular test, 

nd variance can be obtained without actually 
converting from raw to deviation scores, as 
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Identical results will be obtained by either approach, showing here at an 
carly stage the mathematical maneuverability that is gained by working 
with squared deviations. 


THE EFFECT OF LINEAR TRANSFORMATIONS ON THE MEAN 
AND STANDARD DEVIATION 


If an arbitrary number were added to each of the raw scores in a dis- 
tribution, how would this change the mean and standard deviation? For 
example, we could add the number 5 to each of the arithmetic scores 
mentioned earlier, The new mean would then be 5 more than the original 
mean. The proof of this is simple: 

3X 


Mx=y 


x(X+C 

i 
_ 3X+3C 

Poca tree 


=Mx+C 
where C stands for any constant, such as 5 
M:x+¢) stands for the mean of the scores after the constant is 
added to each score in turn 


If then the mean of the original scores is 10, and we add 5 to each of 
the scores, the new mean is 15. (The student should work out the simple 
proofs on his own. Not only will they prove relatively “painless”; but by 
thoroughly understanding the steps in the simple proofs, it will be easier 
to follow more complex statistical developments. ) 

Next we can determine what happens to the standard deviation when 
a constant is added to each of the scores. We could derive this from the 
raw score formula; but since we know that the standard deviation is the 
same whether it is determined from raw or deviation scores, it will be 
simpler to work with the deviation score formula. 

k ilH) — Muto]? 


(#+0) 7 N 
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What the formula shows is that from each new score, which will be the 
original deviation score plus the quantity C, is subtracted the mean of 
the new scores. The mean of the original deviation scores is zero by 
definition. Then the mean of the new scores is C, so we can rewrite the 
formula as 


3(x+C —C)? 
F(a+0) = a 
which reduces to 
Sa 
T(a+0) = WN 
This is the same formula used to obtain the standard deviation of the 


original scores. We have then proved that the standard deviation is 
unchanged if we add or subtract a constant from each of the scores. If 
then the standard deviation of a set of scores is 2, and we add 5 to each 
of them, the new standard deviation is also 2. 

Next we can see what will happen if each of the scores is mult iplied by 
a constant. The mean would then become 


Because a constant term can always be moved to the left of a summa- 
tion sign, 


= CMx 
We have proved that if each 
of the new scores equals the 
then the mean of the origin 
by 3, the new mean is 30, 


The effect on the standard devia 
stant is as follows: 


Score is multiplied by a constant, the mean 
mean of the old scores times the constant. If 
al scores is 10, and we multiply each score 


tion of multiplying each score by a con- 


(Cx)? 
N 


x [3C2x2 
NN 
Sx? 
= cH pera 
Ar) 


o(00) = 
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The standard deviation of the new scores will be C times as large as the 
old standard deviation. If then the standard deviation of the original 
scores is 2, and we multiply each score by 3, the new standard deviation 
is 6. The student should satisfy himself that the variance will increase by 
the square of the constant. 

We have taken the time to derive these results for several reasons. The 
mean and standard deviation will be seen many times in the discussion of 
test results, and it is important to understand their properties. Knowing 
how these statistics are affected by linear transformations will allow us 
to make many computational short cuts without having to re-compute 
mean and standard deviation. Also, the simple proofs which are involved 
here demonstrate some of the manipulations which, when applied to 
more complex equations, permit us to derive many useful results. 


SCORE DISTRIBUTIONS 


The Normal Distribution. The standard deviation is most easily inter- 
preted when working with a normal distribution of scores. The normal 
distribution, or normal curve, as it is sometimes called, was mentioned in 
Chapter 2, but no fuller explanation was given at that time. The normal 
distribution refers to the way in which the score frequencies fall. Instead 
of listing test scores, as we did with the arithmetic and spelling tests, the 
same information could have been given by specifying the number of 
persons who obtained each score. Using the symbol f to stand for fre- 
quency, frequency distributions for the two tests are as follows: 


Arithmetic Spelling 
x f Xs f 
22 1 75 1 
21 0 
20 1 "i. . 
19 1 62 1 
18 0 
17 1 ve Fi 
16 0 56 1 
15 1 55 1 
14 8 54 0 
13 0 53 0 
12 2 52 3 
ll £ 51 1 
50 1 
49 1 
48 1 
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The frequency distributions are presented graphically in Figure 3-1. 
There are not enough scores to indicate much about the distribution form 
for either test. However, when there are many scores instead of just 
eleven, the shape of the distribution begins to become apparent. I! we 
had given the arithmetic test to 200 students, the distribution of scores 
in arithmetic might have looked like that in Figure 3-2. In Figure 3-2 it 


Ficune 3-1. Frequency distributions for arithmetic and spelling tests. 
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Ficure 3-2. Hypothetical frequency distribution of arithmetic scores for 200 students. 
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can be scen that most of the scores are clustered around 16, and that 


scores become fewer and fewer as we go either toward high scores or 
toward lv scores. 

As the scores in Figure 3-2 are drawn, they resemble the normal dis- 
tribution. The standard deviation tells the number of persons whose 
scores full in different regions of the normal curve. Figure 3-3 shows 


Ficuns 5-3. Per cents ° of subjects in various regions of the normal distribution. 


Frequency 


Wee 
N] aS 


-300 -200 -1.00 0 100 200 3.00 
Standard deviation 


° The per cents add up to 99.6 instead of 100 because a fraction of 1 per cent of 
the cases lies above and below three standard deviations. 


that approximately 68 per cent of the persons make scores between plus 
one and minus one standard deviations and 95 per cent of the scores 
fall between plus two and minus two standard deviations. A detailed 
breakdown of the percentages of scores in various regions of the normal 
distribution is presented in Appendix 3. It will prove useful in interpreting 
the many statistics based on the normal distribution. 

Standard Scores. A very useful type of score can be obtained by divid- 
ing an individual’s deviation score on a test by the standard deviation of 
the scores, 


,=— or g= 


where z symbolizes the standard score. 
Applying the formula to Johnny’s arithmetic score we find 


_22— 15.45 
ERETT) 


6.55 
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An individual’s standard score indicates his place in the frequency dis- 
tribution (if the distribution of scores is approximately normal). If a per- 
son has a standard score of 2.1, this indicates that his score is above two 
standard deviations, and that about 98 per cent of the subjects made 
lower scores. Similarly, a person who makes a standard score of less 
than —1.0 is lower than 84 per cent of the subjects. 

When the raw scores on different tests are converted to standard 
scores, they have the same mean, which is zero in all cases. What is the 
standard deviation of a set of standard scores? Because of the way in 
which standard scores are obtained, the standard deviation of any set of 
standard scores is 1.00. 

Converting raw scores to standard scores makes results comparable 
from test to test. If an individual makes a standard score which is 
positive on two tests, then we know that he did better than average on 
both tests. If he makes a standard score of 1.00 on one test and 1.50 on 
another, we know that he performed better on the second test. 

Transformed Standard Scores. For practical purposes, it is often useful 
to express test scores by a modification of z scores. A teacher might have 
difficulty in explaining to a student’s mother that a standard score of 
zero meant average performance rather than zero performance. Also, the 
negative values which are obtained with standard scores are difficult for 
some persons to understand. Starting with a set of standard scores, and 
knowing what we do about transformations, we can derive a new set of 
scores with any mean and standard deviation that we like. If we desire 
to have the scores in such a form that the mean is 100 and the standard 
deviation is 10, the first step is to multiply all the standard scores by 10. 
Then 100 is added to each of the resulting scores, If it is desired to com- 
pare the scores on two tests which have different means and standard 
deviations, the mean and standard deviation of one test can be made 
the same as the other. It proves much easier to make transformations 
directly on raw scores rather than having to compute standard scores 
first. A formula for this is given in Appendix 2. 

Rank-order Scores. In Chapter 1, it was said that psychological meas- 
ures are sometimes expressed in the form of ranks, Ranks are often used 
to describe test scores when the number of subjects is not large. If a 
mother is told that her daughter made the fifth highest score in a class of 
eleven, she can easily understand how well the daughter performed in 
comparison to the other children, 

The scores on the arithmetic and spelling tests can be expressed as 
ranks. Because Johnny made the highest score on the arithmetic test, his 
o ‘pam become rank 1. Eric would have rank 2, Sharon would have 

» and so on. When several persons make the same score, the in- 
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dividual is given the average of the ranks for the group. There are three 
persons who have scores of 14 in arithmetic. They must share the ranks 
of 6, 7, and 8. The mean of these three ranks is 7, and consequently all 
three persons would be given a rank of 7. There are two people with 
scores of 12, and they must share the ranks of 9 and 10. They are each 
given a rank of 9.5. The lowest score, that made by Harry, is then given 
a rank of 11. 

Percentile Ranks. When the number of scores is large, if we are treat- 
ing 200 scores instead of 11, it proves useful to make a conversion of 
ranks. One useful conversion is to express each individual’s score in 
terms of the per cent of scores which are lower. If an individual ranks 
fourth among 200 persons, then 196 persons rank lower. Dividing 196 by 
200, we find that the individual has a percentile score of 98. Because sev- 
eral or more persons will usually make the same score, a modification of 
the basic formula is often used to take account of tied ranks. The formula 
is to divide the number of people below a particular score, plus half the 
number of people who make the score, by the total number of subjects. 
Scores obtained in this way are referred to as “midpoint percentile ranks.” 

Percentile scores, like ranks, are useful in explaining test results to in- 
dividuals who have little background in the statistics of testing. However, 
they are awkward for other statistical manipulations that might be 
needed. Percentiles provide no indication of the distance between scores. 
There might be a wide gap in raw scores between the person who has 
a percentile score of 70 and the person who has a percentile score of 71, 
but this would not be seen from the percentile scores alone. Percentile 
scores underemphasize the differences between scores on the extremes 
of the distribution. Looking back at Figure 3-3, you can see that only 
about 2 per cent of the cases fall between standard scores of 2.00 and 
3.00, These standard scores would be equal to percentile ranks of about 
97 and 99. Although the scores appear to be separated by only a trivial 
amount when expressed in terms of percentiles, we can see that in terms 
of standard scores there is a wide separation. 


Ficurr 3-4. Frequency distribution of percentile scores. 
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What is the distribution form of a set of percentile scores? An equal 
per cent of the subjects make scores between 99 and 90 percentiles 
as between the 89 and 80 percentiles, and so on. Consequently, the 
frequency distribution is flat, or rectangular as it is called (see Figure 
34). 


NORMS 


Although standard scores or percentile scores would give a good in- 
dication of where Johnny stands in relation to his 10 classmates, it would 
still be a question as to how well Johnny performs in relation to the 
students in other schools. It may be that Johnny is in a particularly bright 
class, in which average performance means high performance in com- 
parison to other classes. It might be that Johnny goes to school in a 
“well-to-do” neighborhood where the children as a group are well above 
the community average. To dramatize the specificity of scores in a par- 
ticular setting, we can imagine the problems involved in testing a class 
composed of geniuses. Because of the mechanics of obtaining standard 
scores and percentiles, half the students would inevitably come out “be- 
low average.” But, of course, below average performance for geniuses 
would mean superior performance in a group of less gifted children. 

Normative Populations. In order to understand how well a child per- 
forms on a test, it is often necessary to compare his score with the scores 
made by children in previous years and the scores made by children in 
a range of localities. In order to interpret the test score made by an 
applicant for a particular branch of the Armed Forces, it would be neces- 
sary to compare his score with the scores of many other applicants. The 
larger group to which an individual is compared is called a normative 
population. 

The collection of people which constitutes a normative population is 
determined by the use to which the scores will be put. If achievement 
test scores are being used to place students in special high school study 
programs in a particular city, the normative population should contain 
students from all of the grammar schools in the city. If a selection test 
e page mi is being used to select individuals for pilot training, 

ative population should contain the individuals who have pre- 
viously applied for pilot training. 

On some occasions, the no. 
When selecting students for 
at least hypothetically, 
United States. Later w 
particular “intelligenc 
dren in all parts of th 


rmative population must be quite large. 
college, the normative population consists, 
of all the graduating high school students in the 
e will see that in order to interpret the scores on a 
e test, the test was given to several thousand chil- 
€ country. In other testing situations, the normative 
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population may be quite small. To interpret the results of a course 
examination, the teacher may need only to make comparisons with the 
scores made by several previous classes. 

Scoring in Respect to Norms. It is often more meaningful to compare 
an individual's test score with the scores obtained from a normative popu- 
lation rather than with some smaller and less representative group. In 
this case, an individual’s standard score would be determined by a 
comparison with the mean and standard deviation of the normative popu- 
lation. Similarly, percentile scores would be determined by the scores 
found in the normative population. Instead of going through the labor of 
actually computing such scores for each new individual, it is easier to 
compose a table which allows a direct conversion of raw scores to stand- 
ard scores or percentile scores. 

Sampling of Scores. The group of persons actually tested is usually 
small in comparison with the total number of persons in the normative 
population. Seldom will there be test scores for all of the individuals who 
could conceivably be included. Even several thousand children is a small 
group in comparison with the many millions in this country. The norms 
obtained from a sample are then only estimates of the norms which 
would be obtained from the entire normative population. 

Sampling Bias. When choosing a sample for constructing norms, it is 
important to select a group which is as representative as possible of the 
normative population. The sample would be biased if all of the tests 
were given at one school, because, as we said earlier, the particular 
school might be above or below average for the community as a whole. 
If the sample is meant to represent all of the children in the United 
States, it must be ensured that there is a balance of children from all 
sections of the country, that there is a proportionate representation of 
urban and rural children, and that all races and ethnic groups are in- 
cluded. 

One way of ensuring a representative sample is to choose the subjects 
randomly from the normative population. This might be feasible in 
establishing norms for the performance of students in one city. It would 
be possible to randomly select, say, 500 of the children as a sample. On 
any larger scale it is not feasible to gather a completely random sample. 
It is at least conceivable that a sample of the children in the United 
States could be obtained by selecting randomly from all of the millions 
of children. But this would be prohibitively expensive and time con- 
suming. It might be necessary to go several hundred miles just to test 
one subject who, because of the sampling procedure, was chosen from 
a remote region. When, as is usually the case, it is not feasible to use a 
completely random sample, there are alternative sampling methods 
which can be applied. Some of these will be discussed in Chapter 13. 
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Sampling Error. Regardless of how representative the sample, results 
obtained from a sample can only estimate the true norms in the popula- 
tion. If we find the mean and standard deviation of a test in a sample, 
it must be considered how much these estimates are likely to be in error. 
As in all sampling problems, the larger the number of cases (more items 
in the sampling of content and more persons in the sampling of scores) 
the less the estimates will be in error. 

The Sampling Distribution of the Mean. In an ideal situation, it is 
possible to determine the error that will be involved in making estimates 
from a sample. In this situation, it would be necessary to know in ad- 
vance the statistics, here the mean and standard deviation, which would 
be obtained in the entire population. This would be the case where tests 
were given to all of the subjects in a population. Samples of the tests 
could be drawn randomly just to see how much the mean and standard 
deviations of the samples would depart from the actual mean and stand- 
ard deviations which are known to exist in the population. We could draw 
out samples of different sizes to see how much more error is involved in 
the ones which have a relatively small number of people. 

If we drew 100 samples with 50 test scores in each, the sam ple means 
could be plotted in a frequency distribution and compared with the 
population mean. We would find a distribution resembling that in 
Figure 3-5. In Figure 3-5 we see that the distribution of means is like 


Ficune 3-5. Distribution of sample means about the population mean. 
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the population mean. However, because each sample now contains 200 
instead of 50 persons, we would expect to find the distribution of means 
clustered more tightly about the population mean. Because of the greater 
stability obtained from a sampling of 200 persons, it would be surprising 
to find sample means which deviated by as much as 2 score points from 
the population mean. 

In the same way that a standard deviation can be computed for any 
group of scores, a distribution of means can also provide a standard de- 
viation. We would expect the standard deviation of the means for the 
samples of 50 persons to be larger than the standard deviation of the 
means for the samples of 200 persons. In fact it would be about twice as 
large. 

If, as in our ideal example here, we knew the mean and standard de- 
viation of the population, we could determine what would be the stand- 
ard deviation of the means for a particular sample size, without going to 
the trouble of drawing samples and computing means. The standard 
deviation of the means, usually called the standard error of the mean, 
would he very useful in specifying the accuracy of norms. For example, 
if we know that the population mean is 25 and the standard error of the 
mean is 1, then we can tell how rarely means of a particular size could 
be obtained from samples. Working backward, if we know the standard 
error of the mean for a particular sample size and know the mean of 
one sample, we can state the likelihood with which the population mean 
will differ from the sample mean by any number of score units. 

Unfortunately, in nearly all practical work with norms, only the mean 
and standard deviation can be obtained for one sample. There are statis- 
tical techniques for working backward from sample values to the standard 
error of the mean. A standard error can also be obtained for the standard 
deviation, to determine the extent to which the standard deviation ob- 
tained from a sample is likely to be divergent from the population 
standard deviation. To explain these procedures in detail would take 
us far afield from the central topics of tests and measurements. The pur- 
pose here is to make the reader aware of the existence of sampling error 
and its influence on norms. There are numerous books devoted primarily 
to sampling error (see 1, 2, 3), where detailed procedures are given 
for estimating the accuracy of norms. 

Age Norms. We discussed previously how percentile scores and stand- 
ard scores could be used to express norms. In some situations it is desir- 
able to express norms in terms of children’s ages. One such set of norms 
could be obtained by testing the vocabulary of children at all ages from 
four to twelve years, For this purpose, a list of 100 words varying 
difficulty could be used. The mean score could be obtained for eac 
age group separately. A graphic plot of these might look like that in 
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Figure 3-6. Figure 3-6 shows, for example, that seven-year-old children 
have a larger vocabulary than five- and six-year-old children, as wouk 
be expected. Age norms would prove useful in interpreting the vocabu 

lary score made by a particula 


90 child. Suppose that he is seven yea 
80 old and has a vocabulary score o 
70 25. The mean score for the se 
E year-olds in the normative sample 
3 o is 30. We can then look to find th 
z 50 age group which corresponds to a 
Š 40 score of 25. Although there is no 
€ age group whose mean falls exactly 
=< at this point, we can work as though 
20 the curve were continuous through 
10 out all age levels and determine the 
A fractional age group that would 


4 5 6 7 8 9 10 H 12 correspond to a particular score, 

. sat ae ; Reading from Figure 3-6, it can be 
icune 3—8. Mean vocabulary scores of seen that an age of approximately 
= at one age frora four to twelve 6.5 would Bae pond o a score of 
25. It then can be said that the 

seven-year-old boy has the vocabulary of a six and one-half year old. 
Age norms fit in well with the way in which we customarily think about 
the progress of children, and they are, therefore, very useful in the inte 
pretation of test scores. 
Mental Age. Age norms are employed with some of the “general in- 
telligence” tests, such as the Stanford-Binet. By comparing a child’s score 
with age norms, it can be determined whether he is more or less “in 
telligent” than the average child of his age. A score which is compared 
in this way with age norms is referred to as a mental age. If a six-year-old 
child performs as well as the average seven-year-old child, he is said 
to have a mental age of seven. Similarly, the six-year-old who performs 


only at the level of the average five-year-old would be said to have a 
mental age of five. 


> hild’: tal is 
partly a function of the test which is employed. e 
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Grade Norms. A set of norms which is very similar to age norms can 
be obtained by finding the mean scores for children in various grade 
levels. If a standard spelling test were given to children in the fourth 
through eighth grades, the means could be plotted by grades much as 
they were plotted by ages in Figure 3-6. Comparing an individual's score 
with grade norms gives the teacher an indication of how well the pupil 
is progressing. 

Quotient Scores as Norms. By dividing an individual's mental age by 
his chronological age, a score is obtained which is called an intelligence 
quotient, or as it has come to be known, the IQ. If an eight-year-old child 
does as well on an intelligence test as the average ten-year-old, then he 
would have an IQ of 125. The formula for obtaining the IQ is then 


MA 
10 = Gq X 100 


where MA stands for mental age 
CA stands for chronological age 


In spite of the wide use and popularity of the 1Q there are a number 
of misleading connotations which it bears. Because the IỌ is a ratio, a 
ratio of MA to CA, it suggests that a ratio scale is available for intelli- 
gence. Not only is it questionable whether or not the term intelligence 
should be used with the available tests, but it is certainly unwise to con- 
sider IQ scores as constituting a ratio scale. How unwise this is becomes 
evident when we consider the IQ of a child who got none of the items 
correct on the test. Such a child would certainly be retarded, but it makes 
no sense to say that his intelligence is zero. 

The IQ is misleading when applied to adults. It has been found that an 
individual's score on an intelligence test grows until the late teens and 
then levels off. If a twenty-year-old individual has an IQ of 150, this 
cannot be interpreted to mean that he has a level of ability equal to that 
of the average thirty-year-old. 

The IQ is most correctly considered as a transformed standard score. 
The mean IQ on the Stanford-Binet is approximately 100 (as it should 
be if the test is properly standardized ) and the average standard deviation 
is approximately 16. This is the same distribution that would be ob- 
tained by multiplying a set of standard scores by 16 and then adding 
100 to each. However, if the IỌ is to be construed as a transformed 
standard score, it is necessary that the standard deviation be the same 
for all tests which use such scores. This is unfortunately not the case. 
In other words, if one person makes an IQ of 120 on Test A, and another 
individual makes an IQ of 115 on Test B, it is entirely possible that the 
latter individual has a higher standard score (due to a difference in 
standard deviations of IQ’s on the two tests). 
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GRADING COURSE EXAMINATIONS 


“Grading on the Curve.” There has developed an erroneous impression 
that the characteristics of the normal distribution are such as to suggest 
the grades which should be given on course examinations. One way to 
do this is to give the people above 1.5 standard deviations, about 7 per 
cent of the group, scores of A. This logic can be carried further to fail 
the bottom 7 per cent, or some other per cent corresponding to a section 
of the curve, and to allot the other grades in terms of the characteristics 
of the normal distribution. There is nothing about the normal distribution 
or any other statistical distribution to say what grades should be given. 

The reason that instructors sometimes “grade on the curve” is that it 
simplifies the allotment of grades. The instructor need only allot grades 
to the predetermined percentages of persons and the job is done. In 
some situations this can be justified on the basis that the school, or 
other institution, can retain only a portion of the entrants. It is then 
necessary to fail a certain per cent of the students. Otherwise there is 
little to recommend the procedure. This system of grading takes no 
account of the over-all level of performance of the class. If one class 
performs better than another, this will not affect the percentage of in- 
dividuals who make each grade. The instructor who does his job well, 
who encourages all of the students to do their best, and who is especially 
adept at getting over the subject matter can bring the level of the 
majority of the students above what they would attain in a less fortunate 
class setting. It is then unfair to “grade on the curve” and give only a 
predetermined per cent of grades in each category. 

“Grading on the curve” encourages an unhealthy set of attitudes among 
students. It makes the student hope that some members of the class will 
do poorly, so that his own grade will look good in comparison. It also 
makes the superior student something of an outcast, because his superior 
performance will make the remainder of the scores fall lower on the 
“curve.” There is no substitute for the instructor’s careful judgment about 
the educational goals of his course and the levels of performance which 
merit superior and inferior grades, 

The Influence of Guessing. If a test employs a number of alternative 
answers for each question, an individual will probably get some of the 
questions correct even though he knows nothing about the subject matter. 
If, on a true-false test, the individual were blindfolded and answered the 
questions, or if he flipped a coin to determine the answers, he would 
probably get about half of them correct. The story is told of the recruit 
for the Armed Forces who got none of the items correct on a long multiple 
choice test. How could this happen? The most reasonable answer is that 
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he “cheated.” He knew all of the answers and marked incorrect alterna- 
tives on purpose. 

By making several assumptions, some formulas can be obtained to 
correct for guessing. Let n stand for the number of alternatives on 
each question. In a true-false test, n equals 2; if there are three alterna- 
tives, n equals 3, and so on. Then, if an individual guesses on an item, 
the probability, p, is 1/n that he will get the item correct. The probability 
that an individual will get the item incorrect, q, is (n — 1)/n. We see 
that p plus q equals 1. Then it follows that guessing influences test 
scores as shown by the equation below: 


R=R,+ p(T —R.) 


where R, stands for the number of items that he “actually” knows 
R stands for the number which he gets correct (knowledge plus 
guessing ) 
T stands for the number of questions which he answers 
What the formula assumes is that the score which an individual makes 
is a function of the number which he actually knows, Re, and the number 
on which he guesses. An experimental approximation to Re could be made 
by using many alternatives for each item. As the number of alternatives 
is made larger and larger, guessing has less and less effect on R. 
Some manipulations of the formula will show how corrections for 
guessing can be obtained. 


R=R,-+pT —pR. 
R=pT+(1—p)R. 


R=pT+qR. 
qR. = R— pT 
peta 

q 


This will allow an estimate of the score that an individual would have 
made had guessing not been an influence; i.e., if there had been an infinite 
number of alternatives for each question. 

Some examples will help to show how the correction for guessing 
works. Imagine that we are studying the scores made by students on a 
40-item test, where 5 alternatives are being used for each item. Then p 
would equal .2 and q would equal .8. Suppose that John tries 30 items 
and gets 22 correct. Then the correction for guessing would be 


92-280 
ier 8 
= 20 
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If Bill tries all 40 items and gets 24 correct, the correction would be 


24 — .2 X 40 


a ae 


=)20 


Both of the boys “knew” twenty of the questions. John guessed at ten 
of the items and got the expected percentage correct, achieving a score 
of 22. Bill was more willing to guess. Because he guessed on 20 items, 
he obtained a score of 24. The example illustrates the situation in which 
the correction for guessing is most necessary, when the subjects are 
allowed to attempt as many items as they choose. If a test is constructed 
on that basis, the correction for guessing should be applied to the 
scores. However, there is usually little reason for allowing subjects to 
answer only as many questions as they choose. It is much easier to in- 
struct them to answer all questions whether they know the answers or 
not. Then T becomes the same for all students; and if one student gets 
more items correct than another, he will maintain a higher score after 
the correction is made. 

It should be remembered that the correction for guessing iv only an 
estimate based on the expected effect of guessing. If an individu:1 flipped 
a coin to decide the answers on a true-false test, the expectation is that 
he will get 50 per cent correct. However, this would be so only on the 
average. Luck might have it that the individual would get all of the items 
correct or none of them correct. The effect of guessing is then not only 
to distort the scores, overestimating the scores of the persons who know 
the least, but it introduces a certain amount of randomness, or “error,” 


into the scores. This and other sources of error in test scores will be 
discussed in Chapter 6. 
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CHAPTER 4 


The Evaluation of Psychological Measures 


In these times, it would be difficult to find someone who has not had 
first-hand experience with psychological measures. Course examinations 
in school are examples of psychological measures. The aptitude tests 
which you might have taken in applying for college, for a branch of the 
\rmed Forces, or for a particular job are psychological measures. You 
might have taken personality inventories which classify you as an “extra- 
vert” or “introvert,” or seen ink blot and other less direct personality tests. 
You have probably taken sensory tests of various kinds, for example, the 
‘est for acuity of vision that is required in many states for a driver’s 
license, Perhaps you have participated in psychological experiments in 
which you were required to place pegs in a peg board, react as quickly 
as possible to a light signal, or solve a problem in the shortest possible 
time. These are all within the area of psychological measurement. 

Testing the Test. Tests should be held suspect until their worth is 
proved, Tests that purport to measure intelligence, anxiety, and social 
adjustment may not measure those characteristics at all. If there are a 
number of ways in which a measure can be obtained, a method is needed 
for deciding the relative merits of the different approaches. This takes 
us into the subject of test validation. 

Psychological measures differ markedly in the extent to which their 
usefulness is self-evident. On one extreme, tests of auditory acuity have 
a relatively direct meaning and importance. If the tests indicate that a 
person has very low acuity, he would probably fail in a job that required 
good hearing. Also, it is a safe bet that this person would have difficulty 
in receiving phone messages, in working as a telegrapher, and in trying 
to understand radio programs. 

At the opposite extreme, where the meaning of the test results is not 
at all self-evident, are some of the personality tests. What does it mean 
when an individual describes an ink blot as looking like a butterfly? In 
this case it is necessary to show a connection between the test response 
and what the individual does in daily life. It is necessary to show that 
the test predicts something, in the broad sense of the word. 
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Starting with the distinction between measures that stand for them- 
selves and those which must prove their ability to predict. the outline 
in Table 4-1 shows how different kinds of psychological measures are 
constructed and appraised. 


TABLE 4-1, OUTLINE OF VALIDATION PROCEDURES FOR THE USE OF PREDICTORS 
AND ASSESSMENTS 


Predictors Assessments 

Purpose Prediction of behavior Measurement of belivior 

Standard of validity | Demonstration of Demonstration of representative- 
predictive power ness 

Validation Correlation with an Sampling of content 

procedure assessment (a criterion) 

Construction Hypothesis; collection of Definition: 
items; finding the best a. Actions to be measured 
items b. Persons who stand high and 


low in the measure 


Examples Aptitude tests Achievement tests 
Personality tests School examinations 
General intelligence tests Job ratings 
Special ability tests: Work records 
musical aptitude, art Work samples 
judgment, manual All criteria for evaluating 
dexterity, etc. predictors 


Attitude tests 
Interest tests 
Opinion surveys 
ee gem 


The remainder of the book will be 1 


argely devoted to filling in the outline 
in detail. 


EVALUATION OF ASSESSMENTS 


Assessments ! are measures in their own right. If the instructor in a 
particular course gives you a final grade of A, then you have been as- 


* The reader should be careful to distingui i ” 
krt a o distinguish th a aterm “wasessmearit 
as it is used here from the misl g e meaning of the term “assessme 


à y eading use of the term that often appears in the psy- 
a dees! ye oe the term “personality AREAN is often used 

t uments themselves are really predictor tests. Some psvo ists use the 
term En eT validity” to refa y p r tests. Some psychologists use 


4 T to the procedures which are described i is chapter 
for the validation of assessments, Ẹ ch are described in this chap 
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sessed, or measured, in your course achievement. When an employer 
promotes one of his salesmen, the salesman has been assessed in regard 
to his performance. When the manager of a baseball team retains some 
of his players and sends others to a minor league, the players have been 
assessed by the manager. In each of these cases, the measure of perform- 
ance is determined by someone who has the right, or the responsibility, 
to decide what is good and bad in the particular situation. The course 
instructor decides that a test of a particular form will best measure what 
le wants to be measured. Perhaps the salesman is promoted because he 
sells more articles than other salesmen. The baseball players might be 
retained in terms of a statistical complex of batting average, number of 
errors, runs batted in, or whatever else the manager chooses to consider. 
\n assessment has an immediate importance, at least to the person who 
is being assessed and who stands to gain or lose from the outcome. 

The Standards for an Assessment. We might disagree with the way in 
which particular assessments are made, arguing, in terms of our previous 
examples, that there are better ways to construct course examinations, 
better ways in which to decide on the promotion of salesmen, and better 
ways to determine the effectiveness of baseball players. What evidence 
could we use in our arguments? The most pertinent point to make in 
criticizing an assessment is to question the extent to which the assessment 
represents important types of performance. 

The course examination might be criticized by examining the content 
of the questions and comparing these with the subject matter of the 
course. If the examination is in American history, it might be possible to 
point out that no questions are included on the effect of the mechanical 
revolution on American culture, or other omissions of this kind might be 
noted, Another point of criticism would be to argue that certain of the 
examination questions concern issues which are less important than 
others that might be included. The standard that an assessment should 
meet is that of content representativeness for the achievement which it 
seeks to measure. 

Similar arguments could be made about the representativeness of the 
yardsticks used in assessing the salesmen and the baseball players. We 
might argue with the sales manager that there are other things to consider 
in addition to the number of sales made. It might be pointed out that it 
helps the company more when sales are made of particular new products, 
ones that the company wants to introduce to the public. Another argu- 
ment might be that some salesmen build more good will in their work, 
perhaps forsaking some sales to gain the long-range trust of the client. 
We might argue with the baseball manager that he fails to include some 
important types of performance in his assessments, that factors such as 
how well players stand up in “close” games, how many errors a player 
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causes for the other team, and how much the player contributes to team 
morale should also be considered. 

The Sampling of Content. Using the word “content” to stand broadly 
for all of the possible things that could be taken into consideration in 
making an assessment, it is apparent that the content of an assessment 
is only a sample of the total content which could be considered. It is a 
sample in the sense that the number of items or points of consideration 
in the assessment is very small in comparison with the many that are in- 
volved in most complex performance situations. The history examination 
conceivably could include every fact stated in the textbook, plus those 
given in the class lectures, plus the many questions that could be asked 
about the relationships between the facts. 

Although the course examination is a sample of the total course mate- 
rial, it is seldom a randomly chosen set of questions. If the purpose were 
to select randomly chosen bits of the things that met the student’s eye 
during the semester, it would be sensible to use questions like, “What 
is the middle initial of the author of the text?” “Where was the text 
printed?” “Whose picture is on page 75?” Of course, materials of this 
kind would not be likely to find their way into an examination, and the 
reason is that the sampling is restricted to items that are judged to be 
important by the instructor. We will talk at greater length abou! the 
sampling of content when classroom examinations are considered in 
Chapter 8. i 

Defining an Assessment. The sampling of content for an assessment in- 
evitably involves human judgment and human values. That is why the 
outline in Table 4-1 states that the construction of an assessment begins 
with a definition, a definition of what is desirable in a particular per- 
formance setting. Sometimes the valuing is done by one person only, and 
it can be done arbitrarily if he chooses. The business executive who takes 
the responsibility of hiring and firing, promoting and demoting, may use 
only his own impression of the individual’s achievement. In this case, 
the assessment is defined in terms of the personal likes and dislikes of 
one individual. 

In other situations, the assessment is defined cooperatively by a num- 
ber of persons. Some of the school achievement tests illustrate this. The 
test might be the joint product of, say, a number of history teachers, who 
try to come to an agreement about the important course material. The 
definition, or the statement of what is valued, might be explicitly set out 
in outline form and rules might be established as to how the examination 
questions should be presented. Regardless of how explicitly the assess- 
ment is defined or the extent to which the assessment is a joint effort, 


the assessment stands as a definition of what is “good” or what is valued 
in a particular situation. 
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\n assessment can be defined most explicitly if it is possible to state 
ictions, or behaviors, which constitute successful performance. The 
cral illustrative assessment situations which we have spoken of so far 
defined in terms of actions. For the salesman, successful actions 
sist of selling products. Success for the history student depends on 
wering the examination questions correctly. The successful baseball 
iver performs the actions of making hits, stealing bases, and making 
t-outs. When assessments are defined in terms of actions, it is relatively 
sv for people to examine and discuss the assessment content. 
ln many assessment situations, the actions which make for successful 
performance are difficult to state. What actions are involved in being 
a successful floorwalker in a department store, naval captain, or chef? 
Members of these professions are certainly assessed in terms of perform- 
ance and some given positions of more and some of less responsibility. 
Occupations such as these are difficult to define in terms of concrete 
«tions, and there are often legitimate reasons for disagreeing about who 
is successful and who is unsuccessful in each. It would be hard to defend 
the salesman who never sold an article; but even though you might find 
the products of a particular chef to be less than desirable, there may be 
others who relish his fare. 
if the assessment cannot be explicitly defined in terms of actions, then 
it must at least be definable in terms of persons, in terms of those who 
are relatively successful and unsuccessful. We would want to know 
whether the people who judge the effectiveness of naval captains agree 
about the assessment of each of the captains. If the assessment rests with 
a group of higher-ranking officers, each of them could be asked to rate 
or rank-order 50 naval captains as to their performance. If the separate 
ratings agree highly, if Captain Smith is rated high by all of the judges 
and Captain Green is rated as relatively ineffective in his duties, and so 
on for the captains in between, it can be said that there is a reliable 
assessment being made. We would then know that the same or very 
similar actions are probably being considered by each of the judges, even 
though the judges are not capable of stating what these actions are. 
Criterion Research. If the ordering of people on the assessment proves 
to be reliable, there is something that the measurement specialist can de 
to make the assessment more explicit. He can perform research to leary 
some of the actions which constitute successful performance. The be. 
havior of 50 successful naval captains could be studied and compared 
with that of 50 captains who had been judged to be relatively unsuccess- 
ful in their duties. Studies could be made of the way in which the success- 
ful captains organize their commands, the kinds of responsibility which 
they designate to particular officers, the systems of communications which 
they set in operation for handling emergencies, and so on for the many 
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places in which successful officers might differentiate themselves from 


unsuccessful officers. 
The set of “critical behaviors” or “critical incidents” which are found 
could be used in an assessment rating scale, like the following: 


Yes No 


Does the officer supervise the ship’s navigation 
rather than leave it to other officers? 

Does the officer instruct his men about their be- 
havior when off the ship? 

Does the officer try to keep informed on new tech- 
nical developments? = 
Does the officer discipline his subordinates in the 
company of crew members? 

Does the officer consider the grades made by the 
men in specialized courses when recommending 
promotions? p 


A longer list of actions or behaviors of this kind could be used b; judges 
in making assessments. Although such a list of critical behaviors would 
not be completely satisfactory to all of the judges nor would there be 
assurance that important actions were not being overlooked, the list 
would be at least a start in making the assessment more explicit. 


EVALUATING A PREDICTOR 


The score that an individual makes on a predictor test has no necessary 
importance in its own right. Consider the following type of test material. 


Circle every group of dots that has exactly four dots. 


Do as many as you can in ten seconds. 


Although scores on a test of this kind would have no obvious meaning, 
it can be shown that they would predict performance on certain clerical 
jobs. Employers would, of course, have little interest in the ability of 
prospective employees to circle groups of dots. Their purpose is to select 
applicants who will later prove to be good workers. The purpose of tests 
of this kind is to predict, or forecast, how well individuals will perform, 
or how highly they will be assessed, in important situations. The assess- 


ment which a particular test is intended to predict is called the validation 
criterion. 
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It should be understood that the term “prediction”? is being used in 
the broad sense to encompass functional relationships of all kinds. A test 
could conceivably be used to “predict” what people have done in the past, 
in the same way that the detective gathers clues to “predict” who com- 
mitted a crime. Also, tests could be used to “predict” how people behaved 


when they were children. Tests are often used to “predict” a current 
condition rather than to predict in the stricter sense, or to forecast future 
behavior. A typical example is a test for brain damage. The purpose is 
to “predict” the present condition of brain damage, and the proof of the 
“prediction” would be borne out by detailed physiological investigations. 


The Statistics of Prediction. No test is a perfect predictor and, as we 
shall see later, nearly all predictors entail more error than accuracy in 
forecasting performance. The relative accuracy of predictor tests is 
determined from statistical analysis: statistical descriptions of the num- 
ber of people who are properly and improperly classified, statistical esti- 
mates of the per cent of persons who can be safely forecast at particular 
levels on the assessment, and statistical estimates of the amounts of 
money that can be saved by using a predictor. Because of the need for 
statistical measures in determining the utility of predictor tests, the next 
several chapters will concern the statistical models which are most often 
used in testing, 

Construction of a Predictor. A predictor test begins with a hypothesis, 
a hypothesis that test materials of a particular kind will predict a particular 
assessment. We might hypothesize that numerical problems will predict 
how well individuals will perform in mechanical engineering school. The 
task would then be to assemble a large number of numerical prob- 
lems such as: 


(a) 268 _ (b) 14X64 =———— (c) 95,842 
a a —25,983 


A large collection of items of this kind would be called an item pool. The 
problem is then to select from the item pool the minimum number of 
items which afford the best prediction of an assessment. This is done by 
statistical procedures of item analysis, which will be discussed in Chapter 


Examples of Predictors. Most of the measures which psychologists con- 
struct are predictors. There are very few places in which the psychologist 


? Different terms are often used to refer to functional relationships with past, cur- 
rent, and future events. Functional relationships with past events are more properly 
termed “postdiction.” Functional relationships with current statuses a oten oi 
“diagnosis.” Whereas the proper meaning of the word predict is to forecas ure 
behavior, it will be used here to refer to all three kinds of functional relations. 
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has the responsibility of deciding what is good = bad ia patil 
performances, the exception being when the psycho ogist gives gi at es 
to his own students. All the aptitude tests are predictors; they are meant 
to forecast how well people will perform as salesmen, students, naval 
captains, artists, etc, All personality tests are predictors. They se ie 
to predict the likelihood that a person will have a “nervous brea - 
down,” enter a mental hospital, or how he will react in emergencies, how 
effective he will be in positions of leadership, and how successful he is 
likely to be in marriage. 

The so-called intelligence tests were initially instruments of the pre- 
dictor type. Although some have tried to argue that these tests are 
measures, i.e., assessments, of intelligence, the domain of all possible 
actions that could be called intelligent behavior is not agreed on by 
different investigators, and the domain is too large to permit an adequate 
sampling. 

Whether an instrument is classified as an assessment or as a predictor 
often depends on the way in which it is used. On a test of attitudes 
toward Eskimos, for example, an individual answers “yes” to questions 
such as, “Do you think that Eskimos are as intelligent as other people? 
“Would you be willing to have an Eskimo as a personal friend?” “Would 
you be willing to make a donation for helping Eskimos?” It could then be 
said that the individual responds as though he is favorable to Eskimos. 
Consequently, the instrument could be considered an assessment of 
verbal report. However, there is no guarantee that the individual would 
actually behave in a friendly manner toward Eskimos, that he would 
treat them as intelligent persons, make friends with them, or donate 
money for their aid. Therefore, the instrument is being used as a predictor 
of how people will behave. . 

An interest test can demonstrate what an individual says that he likes 
to do, but the verbal report alone gives no assurance that the individual 
will actually seek out the things which he professes to enjoy. The in- 
dividual who says that he likes classical music may never attend a concert 
and may habitually switch the radio from a presentation of Beethoven to 
other forms of musical fare. Similarly with the responses given in an 
opinion survey, it is not necessarily so that an individual will behave as 
though he holds a stated conviction. 

When an instrument tries to determine only an individual’s verbal re- 
sponse, it can be said that an assessment is being made. Any effort to use 
the response as an indication of how an individual behaves in other sit- 
uations or how he will react in the future means that the 
being used as a predictor. 


Numerous studies have found rel 
and what they do later, but the impe 


instrument is 


ationships between what people say 
rtant point here is that it is necessary 
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to prove such relationships rather than take them for granted. Also, what 
people say they like, think, and feel is an important type of information 
regardless of whether it predicts or is used to predict subsequent actions. 


When Not to Construct Predictor Tests. Predictor tests have proved 
themselves useful in countless practical situations. They have been par- 
ticularly useful in the selection and assignment of people in schools, in 
the \rmed Forces, in government agencies, and more recently in industry. 
Car must be taken that the demand for new tests does not bring forth 


poorly standardized and improperly evaluated instruments. 

The test constructor must be cautious at the outset not to launch a 
program of research which has little chance of ending in success. Before 
the construction of predictor tests is undertaken, the nature of the assess- 
ment, or criterion, should be examined. It is most fortunate if the assess- 
ment is defined in terms of easily observed actions. If the assessment 
cannot be specified in terms of actions, it is essential that the assessment 
at least be expressed in terms of the persons who stand high and who 
stand low in the particular performance. It is very important in this 
latter case to make sure that the ratings of persons are reliable. If neither 
the actions can be specified nor the people be reliably rated, it is a waste 
of time to work on the construction of predictor tests. There will be no 
way to evaluate the tests which are constructed. In some situations, the 
best service that the test constructor can perform is to tell the people 
who \vant predictor tests that it would be futile to begin such work 
until more systematic assessments are developed. 


CONSTRUCT VALIDITY 


The distinction between predictors and assessments will serve to judge 
many, if not most, of the instruments in psychology. If a measure can 
be unambiguously categorized as either a predictor or an assessment, the 
rules for constructing and validating the measure are reasonably clear. 
There is a third category of measures that does not fit completely into 
either of the two basic divisions. Measures of this kind are encountered 
most often in psychological experiments. In a typical experiment, the 
purpose is to test the hypothesis that difficult learning tasks raise anxiety. 
In order to test the hypothesis, it is necessary to measure anxiety. The 
measure of anxiety cannot be regarded as a predictor, because the pur- 
pose is not to predict some other action but to measure anxiety then and 
there. Neither can the measure be an assessment, because there is no 
agreed-on set of “anxiety actions,” or content, to be sampled. 

Construct validation consists of defining a measure in terms of numer- 
ous research findings. This is essentially the way in which intelligence 
tests have gained meaning. Enough research has been done with these 
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instruments to know how the underlying function grows with the child 
and how intelligence test scores relate to numerous other variables. This 
gradual defining of an instrument in terms of what it does is the major 


approach to validating some psychological measures. It is not sure nor 
an easy course, and psychologists are not entirely in agreement as to 
what constitutes construct validity. There is also a danger that instru- 


ments which should be considered as either predictors or assessments and 
validated as such will hide under the cloak of construct validity. Con- 
struct validity should be considered a last resort, when the instrument is 
not specifically intended to predict performance or when the instrument 
cannot stand alone as a self-evident measure of what is intended. How- 
ever, there are some instruments which can be judged only by accumu- 
lated research experience. 


OTHER LOGICAL CONCEPTS 


The logic which has been presented here for the evaluation of psycho- 
logical measures is not one that would be agreed on by everyone in the 
field of testing. There are numerous other logical classifications which 
have been used in explaining the usefulness of measures. Some of these 
constructs will be listed and compared with the logical scheme de- 
scribed above. 

Face Validity. Face validity refers to whether or not the test con- 
tent “looks” as if it measures what is meant to be measured. This is 
certainly one of the important properties of an assessment, such as a 
school examination; but this standard of validity has often been ap- 
plied to predictor tests. Personality inventories have sometimes been 
claimed to be valid because they have face validity. This is a poor 
way to evaluate a predictor. Indeed, with personality tests, the more 
the test items bear an obvious relationship to what is being predicted, 
the more easily the test can be “faked” by the subjects. 

Predictor tests should not be judged in advance because they do or do 
not have face validity. Sometimes the most unlikely looking test will 
prove to be an adequate predictor. We said earlier that a predictor test 
represents a hypothesis. One of the prime rules in science is never to 
exclude a hypothesis before it has been tested. There are numerous cases 
in which scientific progress was held up because of a refusal on the part 
of some to leta hypothesis be tested. Some of these were the germ theory 
of disease, the heliocentric theory of the sol 

falling bodies. Predictor tests can only be judged empirically, by the 
strength of their relationships with assessments. 

: pete between Predictors. A not uncommon practice has been 
o correlate a new predictor test with a better-established predictor test. 


ar system, and the law of 
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If the correlation is high, this is taken as evidence that the new test is 
“valid.” The difficulty with this procedure is that a new test can correlate 
highly with an older test but not correlate at all with the assessment it 


is meant to predict. For example, the older test might correlate .60 with 
a particular assessment. The new test might be found to correlate as high 
as .70 with the older test. It is possible in this case for the new test to 
correlate zero with the assessment. It is always much more to the point 
to correlate a predictor directly with the thing it is meant to forecast. 
There are only two cases in which correlating one test with another 
will provide definite information. If the correlation between the two tests 
is nearly perfect, close to .90, then the two tests are almost identical and 
should be approximately equal in predictive effectiveness for any assess- 
ment, On the other extreme, if the correlation between the two tests is 


very low, approaching zero correlation, it is certain that the two tests 
are measuring different things. 

We can get some information from correlating one assessment with 
another of its kind. For example, we might try out two achievement 
tests which are both intended to assess the student’s knowledge of history. 
If scores on the two tests do not correspond, if the correlation is not high, 
we would wonder about the differences in content which made one 
measure different from the other. This might lead to a better appraisal 
of the subject matter, a more explicit outline of the major topics, and 
eventually to a more comprehensive achievement test. However, the 
observation that two assessments correlate is not necessarily an indication 
that the assessments are “good.” For example, one instructor could meas- 
ure course achievement in history by seeing how far the students could 
throw the textbook. Another instructor could use as a test how far the 
students could throw their notebooks. The two measures would correlate 
highly, but neither would have anything to do with history. 

Factorial Validity. A primary concern of measurement specialists is 
to find tests that measure some common function and thus define what 
is called a factor, Factor analysis is the statistical procedure for isolating 
these common functions and will be discussed in Chapter 9. It has been 
found, for example, that tests like reading comprehension, vocabulary, 
and word problems have in common a factor of verbal comprehension. 
Other groups of tests define reasoning, memory, and perceptual factors. 
Factor-analysis studies help us to understand the nature of human attri- 
butes and provide a very useful classification scheme for available tests. 

If a test is described as a measure of verbal comprehension, some proof 
should be offered that the classification is correct. The proof consists of 
correlating the test with the factor which it is intended to measure. It is 
sometimes found in this way that a test which is intended to measure a 
particular factor is in fact more related to some other factor. Although it 
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is quite useful to learn the factor composition of particular tests, the 
term “factorial validity” should be accepted with some caution. Even 
if a test has high factorial validity, it may be invalid as a predictor of 
particular criteria. Factorial validity indicates only that a test is properly 
classified but not necessarily that it is good or useful for any purpose. 


RELATIONSHIPS AMONG VALIDITY CONCEPTS 


Although most measurement specialists recognize predictive validity 
and content validity, there is some controversy as to how these concepts 
apply to currently available tests. For example, there are some who claim 
that intelligence tests should be judged by content representativeness, 
or content validity (see 4, chap. 7). The point of view given here is that 
intelligence tests must be validated by standards of predictive and con- 
struct validity rather than by standards of content sampling. Whereas the 
point advocated here is that most tests of personality should be regarded 
as predictor tests and validated as such, there are those who argue for 
face validity. (See 1, 2, 3, 4, 5 for alternative points of view about test 
validation. ) 

Some measurement specialists regard factor analysis as an end in itself 
and factorial validity as the key standard for test validation. The point 
of view advocated here is that factor analysis is very useful in classifying 
and exploring the nature of tests, but that factor analysis should bi con- 
sidered only the first step in determining the practical usefulness of 
particular instruments. 

Even if a test can be classified clearly as a predictor or as an assess- 
ment, this does not mean that the standards and procedures that apply 
to other kinds of measures are not useful. For example, some of the 
validation procedures that apply to predictor tests are useful with assess- 
ments, such as with achievement tests. Although achievement tests, and 
all assessments, must ultimately be judged by standards of content repre- 
sentativeness, it is helpful to learn whether the tests correlate well with 
other achievement tests and whether the tests predict later achievement. 

Factor analysis is useful in exploring the content of achievement test 
batteries. For example, a factor-analysis study of achievement tests might 
show that a test which is intended to measure general knowledge of the 
social sciences is overly slanted toward economics or toward some other 
special field. This information would help the investigators to revise the 
test in such a way as to include a broader coverage of social science 
material. l 

Whereas a predictor test must ultimately be judged by its ability to pre- 
dict (in the broad sense of the word) important behavior outside the test- 
ing situation, other standards are useful. An effort to understand the 
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content of predictor tests and to hypothesize what the tests meas- 
ure are natural ways of exploring the nature of human abilities, As 
was said previously, factor analysis plays an important part in helping us 
understand the nature of human abilities as it is reflected in predictor 
type tests. Also, it is often helpful to correlate one predictor test with a 
number of others in order to get some hunches regarding practical uses 
for an instrument. 

Construct validation requires not only some of the procedures used 
witli assessments and predictors but also some other procedures as well. 
The major approach is to determine whether the experimental evidence 
obtained from applying an instrument fits the “construct,” or the theory, 
for which the instrument was designed. For example, in the construct 
validation of an instrument to measure “anxiety,” it is important to deter- 
mine whether or not the instrument gives results that are in compliance 
with a theory about anxiety. If the theory holds that anxiety increases in 
particular kinds of frustrating situations, an experiment can be under- 
taken to determine whether people who are placed in the situation obtain 
higher anxiety scores than people who are not placed in the situation. 
Many such experiments will help to determine whether an instrument 
measures a particular construct, such as the construct of anxiety in the 
example. 

It is primarily important to keep in mind that an instrument is validated 
for particular uses rather than in a general sense. For example, the test 
which predicts one job performance well is not necessarily a good pre- 
dictor in general. In fact, it may have no validity at all for predicting 
other kinds of vocational accomplishment. A mathematics examination 
may have good content representativeness for a particular unit of instruc- 
tion, but the test will not necessarily be good for other uses. If, as in the 
example above, a test proves to have construct validity for the measure- 
ment of anxiety, it is not necessarily the case that the test will be pre- 
dictive of marriage happiness or any other set of behaviors that might be 
predicted. The proper aim, therefore, is to try to determine how well an 
instrument serves some function, or set of functions, rather than to try to 
prove that an instrument is generally “good.” 


SUMMARY 


There is still some controversy as to how psychological measurements 
should be validated. Although they may not use this particular nomencla- 
ture, most psychologists recognize the categories of predictors and 
“assessments.” If an instrument is used as either a predictor or an assess- 
ment, the rules for validation are relatively clear. Primarily, a predictor 
must demonstrate functional relations with important behaviors outside 
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the testing situation, and an assessment must rest on sound methods of 
content sampling. In addition to the primary validation procedures for 
predictors and assessments, some auxiliary empirical and statistical pro- 
cedures are useful. 

Some instruments can neither be classified entirely as predictors nor as 
assessments. These types of instruments are often used in psychological 
experiments. If the instruments measure what they are purported to 
measure, they should demonstrate construct validity. There is consid- 
erable controversy about construct validity—whether it is a meaningful 
concept, and if so, how it is to be studied. Construct validation is essen- 
tially a “bootstrap” operation in which an instrument that initially has 
little meaning in its own right gains meaning through continued use and 
research. 
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CHAPTER 5 


Predictive Efficiency of Tests 


Predictor tests are used to improve the “bets” that are made about how 
individuals will perform in particular situations. As background for under- 
standing how tests are used, let us first look at a situation in which there 
is no test available for making predictions. Imagine that we are trying to 
predict the grade averages of graduating college students. The grade 
averages are at hand, the mean and standard deviation are known, but 
there is no indication as to which student made which grade average. 
Also, we do not know any of the students personally and have no infor- 
mation about their individual capabilities, Imagine then that someone 
calls off the name of each student in turn and you are required to guess 
the grade average of each. With this complete lack of information about 
the ability of the individual students, what would you guess for Bill Smith, 
Mary Jones, Tom Wilson, and each student as his name is read off? 

Would the predictions be most accurate by betting that the whole 
group made low scores, that some made high and others low scores, or 
should the bets be made on some random basis? You would lose by 
choosing any of these courses. The proper course would be to bet that all 
of the students made exactly the same grade, the mean. The mean is the 
point about which the sum of squared deviations is a minimum; and 
without a knowledge of the ability of individual pupils, the over-all 
amount of error will be least by predicting that all students make average 
grades. 

Why should you make a series of bets that are sure to be largely in 
error? Only a minority of the grades will be exactly at the mean, and the 
standard deviation may indicate that some of the grades vary considerably 
from that point. The reason is that, in spite of some large errors in pre- 
diction that are bound to occur, the over-all amount of error in predic- 
tion would be even larger if any system of betting were employed other 
than betting at the mean for all students. We know this is so because of 
mathematical proofs, and we must learn to accept mathematical proofs 
even if they do not correspond with common-sense expectations. Proce- 
dures of statistical estimation are always on the conservative side, allow- 


ing bets to venture away from the mean as more and more information is 
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obtained about particular persons. Only in the case of a perfect predictor 
test would the bets vary as widely as the actual assessment scores. 

Statistical Methods in Scientific Activity. Perhaps it would be wise to 
consider for a moment why there is need for statistical procedures in sci- 
entific enterprise. In classical mathematics, the kind that you encoun- 
tered in introductory algebra, simultaneous equations like the following 
are often found: 


X Y. 
a 2 1 
b. 4 2 
CoS 4 
d. 10 5 


Each pair of numbers above forms an equation. Equation a says that 
when X is 2, Y is 1, or, in other words, that X is twice as large as Y. Equa- 
tion b says that when X is 4, Y is 2, which shows again that X is twice as 
large as Y. Equations c and d also affirm that X is twice as large as Y. 
The same relationship would be found by adding the four equations 
together. When a number of equa- 
tions all supply the same informa- 
tion, they are said to be equivalent. 

One of the exercises in elementary 
algebra is to make a graphic plot of 
equations like those above. This is 
done in Figure 5-1. Any set of two- 
variable equations, called a bivari- 
h ate relationship, can be plotted as 

01 2345678 9 49 im Figure 5-1 by locating the point 
x for each equation in terms of the 

Ficune 5-1. Graphic demonstration of values on the X and Y axes. In Fig- 
Scie relationship between variables Ure 5-1 all of the points lie on a 

$ straight line. 

; Now let us consider some equa- 
kae re oas real-life things rather than abstract mathematical ais 
ns. Turning back to the earlier example of the prediction of college 
grade Asi imagine that a predictor test is in use. Let us say in this 
case that a vocabulary test was administered to incoming freshmen and 


that the scores are to be com i 
pared with grade averages 5 donte 
reach graduation. g verages when the students 
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r to find out how well the predictor test works, we 
to compare vocabulary scores with grade 


made of the raw scores on the two variab 
scores is dependent on the method of te 


would want 
averages. Comparisons could be 
les, but, because the size of raw 
st construction used, this would 
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cause some confusion. The important consideration is whether the stu- 
dents who make high vocabulary scores also make high grade averages 
and whether the students who make low vocabulary scores also make low 
grade averages. Consequently, it is most meaningful to express both 


vocabulary and grade average in standard score units. To simplify the 
comparison, we will deal with only nine students. Means and standard 
deviations are obtained on vocabulary and grade averages for the nine 
students only. Standard scores are determined for the nine students on 
vocabulary and grade average, with the following results: 
Student Vocabulary (z+) Grade average (zy) 

a. 1.55 1.18 

p. 1.16 1.77 

A yii 59 

d. 39 —1.18 

e. .00 59 

f: = 20 — 59 

g — 17 — 59 

h. —1.16 + 59 

i —1.55 —=1.18 


Just as in the earlier example we considered each pair of numbers as 
constituting an equation, each pair of standard scores above also forms an 
equation. Person a has a vocabulary standard score (z+) of 1.55 and a 
grade standard score (zy) of 1.18. Then, by dividing 1.55 into 1.18, a ratio 
of .76 is found. In the previous example, only one equation was necessary 
to determine the over-all relationship between two variables. If that were 
so in the present example, we could say that each person’s grade standard 
score is exactly .76 times as large as his vocabulary standard score. Let us 
go on to see what relationships the other equations specify. The equation 
for person b indicates that the ratio is 1.53, c indicates .77, and d indi- 
cates —8.03. It is obvious that these equations provide very different 
indications of the over-all relationship between vocabulary (Zs) and 
grades (z,). When, as in the example here, simultaneous equations do 
not agree with one another, they are said to be nonequivalent. 

Correlation. Nonequivalent relations are the typical and not the unusual 
Scientific finding. Whether the subject matter is physics, chemistry, biol- 
ogy, or psychology, the numerous observations tell different stories about 
the form of a particular relationship. In some investigations, where the 
variables are simple and easily controlled, the divergence from equiva- 
lence is very small. In more complex problems, such as predicting how 
individuals will perform on a particular job, the divergence from equiva- 
lence is usually considerable. Statistical methods were developed as a 
means of dealing with nonequivalent equations. 
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A graphic comparison is made of vocabulary and grade standard scores 
in Figure 5-2 (the “best-fit” line in the figure will be explained shortly), 
It is obvious that the points do not lie on a straight line, but there is a 
tendency for the two variables to go together. How would you summarize 
the relationship? We might use only our impression and say that there is 
a “strong correspondence” or a “moderate correspondence.” A better ap- 
proach is to use a statistical statement of the degree of relationship 
between vocabulary and grade averages. 


Ficure 5-2. Comparison of standard scores on a vocabulary test with grade point 
averages expressed as standard scores. 
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There are a number of ways in which statistics can be developed for 
the problem here. One that might seem reasonable at first thought is to 
add together all of the vocabulary standard scores and divide this sum 
into the sum of the grade standard scores, This method worked with the 
equations in the earlier example, but it will not do here. As you will 
remember from Chapter 3, the sum of standard scores on any variable is 
zero. Consequently, we would get zero equals zero as a summarizing 


equation, which is certainly true, but does not provide the needed 
statistic. 
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Previously it was shown that working with squared scores and squared 
deviations leads to some very useful results. We will appeal to that prin- 
ciple again by looking for a summary equation which leads to the smallest 
sum of squared errors in prediction. That is, we will seek a ratio between 
vocabulary and grade standard scores, such that the sum of the squared 
errors in predicting grades will be at a minimum. This means that we are 
seeking an equation like the following: 


zi, = bzs 
such that 
—7)\2— ini 
X(x, — z)? = a minimum 


where b is a constant 
žy is a grade standard score 
3s is a vocabulary standard score 
=, is an estimate of a grade standard score 


The equation above is referred to as a linear equation: no matter what 
value we choose for b, it results in a straight line of relationship. Of the 
infinite number of values that could be chosen for b, there is only one that 
satisfies the “least squares” criterion. The method for determining the 
correct b value is derived quite easily from elementary calculus. The 
method which is derived in this way can be applied to the present prob- 
lem. The first step is to multiply each vocabulary standard score by its 
corresponding grade standard score, as follows: 


155 118= 1.83 
116 Sei fame 2 0p 
TTX Epo A 
.39 X —1.18 = — .46 
00OX 59= .00 

— 89xX-— 59= 28 


SZezy 1.08 
Bey 

ber 

_ 7.06 

S EQS 
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The final steps are to sum the standard score cross products (Ezz) 
and divide the result by the number of persons (N). This is all thati 
necessary to obtain b, the value which satisfies the “least squares” cri- 
terion. In this case b is .78, which means that the best (leasi squares). 
estimate of each grade standard score is obtained as follows: 


a, = .78z, 
Then the estimated grade standard score for person a is 
Z = 18 X 1.55 
= 121 


Person a actually has a grade standard score of 1.18; consequently, the 
estimate is in error by .03 score units. Although there will be crrors in 
prediction by using b to summarize the relationship, the sum of the 
squared errors will be less than that obtained from any other linear 
equation. 

Because the prediction equation above is a linear formula, b can 
used to draw a “best-fit” line as is done in Figure 5-2. For every unit the 
line moves over the vocabulary axis, it moves up .78 units on the grade 
average axis. Those who have studied trigonometry will recognize b as 
the tangent of the angle formed by the best-fit line and the vocabu- 
lary axis. 7 

When the constant term, b, is sought by the “least squares” criterion i 
respect to standard scores, it is given a special name: the correlation 
coefficient. Also, when dealing with standard scores, the symbol r is usu- 


ally employed instead of b. Then we can rewrite the best-fit e juation: 
as follows: 


w= 1%, 
To obtain the best estimate of the grade standard scores, we multiply each 
vocabulary standard score by the correlation coefficient. 

When the constant term is found for standard scores, it not only serves 
to place the best-fit line, it has descriptive value of its own. The correlation 
coefficient, r, varies in different problems from +-1.00 through zero to 


—1.00. If the correlation is +1.00, it means that there is a perfect correla- 
tion. Then the prediction equation becomes 


Z, =e, 


and the predicted standard scores will be exactly the same as the actual 
grade standard scores, As the correlation coefficient goes downward from 
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+-1.00, the predictions become poorer and poorer. When the correlation 
reaches zero, the prediction equation is as follows: 
“=r, 
= 0z, 
=0 


Because the mean of a set of standard scores is zero, a zero correlation 
leads to the prediction that all scores fall at the mean of the assessment. 
A negative correlation means that high scores on one variable go with low 
scores on the other variable and vice versa. 


Although the formula for the correlation coefficient, r, above is rela- 
tively casy to understand, it is not the easiest approach to computing the 
correlation. If scores are standardized over several hundred persons, 
rather than over the nine cases employed above, it results in considerable 
work. Therefore, it is easier to compute the correlation coefficient from 


deviation scores or from raw scores. Some manipulations of the basic 
formula will show how this is done. 


r= ats (5-1) 
Oe) 

ORG es 

3 N 

_ y 

~ Nozoy 

et (5-2) 


Formula (5-2) can be used to compute the correlation coefficient from 
deviation scores. Next we can replace each x value in Eq. ( 5-2) by 
X — M, and each y value by Y — M, and obtain the following raw score 
formula: 


N3XY — (3X)(3Y) 
r= 
\/NSX? — (3X)? VN3Y? — (3Y)? 
It must be remembered that the correlation coefficient specifies the rela- 
tionship between two sets of standard scores. Either the computations 
begin with standard scores, as in Eq. (5-1), or the standardizing is done 


; : formulas 
in the computations, as in Eqs. (5-2) and (5-3). The three 
above will mee exactly the same correlation coefficient. Although Eq. 


(5-3) 
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(5-3) looks complicated, it is actually the easiest way to obtain the cor. 
relation coefficient if an automatic calculator is available. 

It is not only easier to compute the correlation coefficient from devia- 
tion scores and raw scores, it is often convenient to form the prediction 
equation from deviation scores and raw scores. 

The standard-score formula 


Z, = 12, (5-4) 
can be converted to the deviation score formula 
yarn (5-5) 
oy 


where y/ is the estimated deviation score on the assessment ( grades) 

x is the actual deviation score on the predictor (vocabulary ) 

r is the correlation between predictor and assessment 

oy is the standard deviation of the assessment 

oz is the standard deviation of the predictor 
Formula (5-5) could be used to predict students’ deviation scores in col- 
lege grades from their deviation scores on the vocabulary test. Thus, if a 
student has a vocabulary deviation score of 3 and the correlation is .50, 


and the standard deviations for vocabulary and grades are 2 and 1 respec- 
tively, the prediction is as follows: 


y= .50X%X8 
25X38 
=.75 


The prediction is that he will score .75 deviation score units above the 
mean of grade averages. Let us say that he actually makes a deviation 
score of 1.25 in grades, and therefore, the prediction is in error by .50 
deviation score units. 


The prediction equation can also be expressed in raw score terms: 


v=r(2)(x—M,) +m, vee) 

a 

where Y’ is an estimated raw score 
X is an actual raw score on the predictor (vocabulary) 
oy is the standard deviation of the assessment 


oy is the standard deviation of the predictor 
M, is the mean of the raw 


M; is the mean of the raw 
Although Eq. (5-6) looks com 
Eqs. (5-4) and (5-5) 


a raw vocabulary scor 


on the assessment (school grades) 


assessment scores 
predictor scores 


plicated, it is a straightforward extension of 
- Formula (5-6) might predict that a person with 
e of 20 will make a raw grade average of 4.0. 
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Both Formulas (5-5) and (5-6) can be used to plot best-fit lines. 


Formula (5-5) would be used if deviation scores are plotted, and For- 
mula (5-6) would be used if raw scores are plotted. The correlation 
coefficic.it and the three prediction equations above are the basic statis- 
tical procedures needed for the validation and use of predictor tests. 


The correlation coefficient is, of course, not used to predict grade aver- 
ages that are already known, but to forecast the grade averages of incom- 
ing freshmen. If subsequent groups of incoming freshmen are students 
of about the same caliber as their predecessors, the correlation between 
vocabulary and later grades will be approximately the same. Therefore, 
it is sate to use the first correlation to forecast the grade averages 
of incoming freshmen. 

The Standard Error of Estimate. Another way to understand correla- 
tional analysis is to think in terms of the errors of prediction. Figure 5-3 
shows 2 typical comparison of raw scores on a test and on an assessment. 
(Whereas in the earlier example the correlation was .78, here the correla- 
tion is .50.) Here we will assume that the best-fit line is known to begin 


Ficuns 5-3. Scatter diagram of vocabulary scores and school grade averages. 
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with, as shown in Figure 5-3, and see how it is used to make predictions. 
When a bivariate relationship is pictured like that in Figure 5-3, it is 
referred to as a scatter diagram. The points are plotted in terms of inter- 
vals instead of discrete scores. That is, all of the vocabulary scores from 0 
through 9 are plotted in the first column, all of the scores from 10 through 
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19 are plotted in the second column, and so on for all intervals. Each 
column is referred to as an “array,” and there are as many arrays as there 
are intervals on the vocabulary axis. 

In order to predict any student’s grade, simply find his score on the 
vocabulary axis and then move upward to the best-fit line. Then move 
parallel to the vocabulary axis to the predicted grade average. If à student 
makes a vocabulary score of 10, the prediction is that he will make a 
grade average of approximately 2.92. If a student makes a vocabulary 
score of 80, the prediction is that he will make a grade average of approx- 
imately 4.12. In each case we predict as though all of the points lay on the 
best-fit line, but, in fact, the points in each array scatter above and below 
the line. This tells us that a prediction from any point on the vo abulary 
axis about how well an individual will perform in school is likely to be in 
error in proportion to the scatter of the Y points about the prediction line. 
If the scatter about the best-fit line is large, then prediction will he poor; 
as the scatter lessens, the prediction becomes better and better. 

In Chapter 8, the standard deviation was given as a measure of disper- 
sion. The same kind of measure can be used in the scatter diagram to 
describe the error in predicting grades from vocabulary scores. A measure 
of the standard deviation type could be determined for any one array by 
subtracting the predicted grade average (at the point of the best-fit line) 
from each of the grade averages actually found. We could then {id the 
standard deviation of the prediction errors. A like measure could be 
obtained for each of the arrays, and these could be averaged to determine 
an over-all estimate of the dispersion about the best-fit line, The method 
used in practice is slightly different from this procedure. Instead of finding 
the separate dispersions and averaging them, all of the deviates are deter- 
mined about their respective points of prediction, and these are placed in 
the conventional standard deviation formula. The resulting value is called 
the standard error of estimate and will be symbolized as cost. 

The standard error of estimate is an absolute measure of error in pre- 
dicting Y, but the units in which the assessment is expressed are often 
arbitrary functions of test construction practices. If, for example, the 
assessment is a course examination, the grades could range from 1 to 10, 
50 to 100, or assume any other range in accordance with the system of 
grading used. Consequently, it is necessary to compare the standard error 
of estimate with the standard deviation of the assessment before the 
efficiency of prediction can be determined. 
fell at the mean of 3 Thes ete 7 ia ae a Seny r 
E & aed 3 e e s would be in error in accordance with 

eviation of assessment scores which is actually 
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obtained. The standard deviation of Y would specify the likelihood with 
which predicted scores would fall within certain intervals on the assess- 
ment continuum. Then, as would be expected, when the correlation 
between the test and the assessment is zero, the standard error of estimate 
will be the same as the standard deviation of Y. As the correlation be- 
comes larger and larger, sest becomes comparatively smaller than øy. The 
ratio of these two measures of dispersion leads to a very useful index of 
predictive efficiency. 

It is easy to picture what the scatter diagram will look like as the cor- 
relation becomes larger and larger; the area of scatter will become pro- 
gressively more narrow. This pro- 
gressive narrowing can be seen in 
Figure 5-4, 

The correlation coefficient con- 
cerns the squares of oy and ces. The 
squares are used because the result- 
ing formula indicates how the best- 
fit line should be drawn, as well as 
indicating the degree of relationship 
between the two variables. As was 
previously noted, the square of any 
standard deviation type value is 
called the variance. The ratio of the 
two squared values is then a vari- Predictor test score, X 


Assessment score, Y 


ance ratio, a term that will be seen Ficure 5-4, Areas of scatter for different 
quite often in testing theory. If this size correlations. 

ratio were used as a measure of 

relationship, it would read “backwards”: a high ratio would indicate a 
weak relationship, and a low ratio would represent a strong relationship. 
This difficulty is overcome by subtracting the ratio from 1, resulting in 


the following formula: 


pane (5-7) 
v 
aa (5-8) 
Y 


The letter r stands for the correlation coefficient, r? is said to be the “per 
cent of variance explained,” and 1 — r? is spoken of as the “per cent of 
variance unexplained.” It is perhaps easier to visualize the meaning of t 
than r itself, but as was shown previously, the correlation coefficient is 


directly useful in making predictions. 
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Although ost does not appear directly in Formulas (5-1), (5-2), or 
(5-3), it can easily be derived after the r is obtained. Formula (5-7) can 
be suitably transformed as follows: 


est = oy /1— 77 (5-9) 


Nonlinear Relationships. We have spoken so far only about linear rela- 
tionships. Formulas (5-1), (5-2), and (5-8) indicate the linear correla- 
tion between two sets of scores, and Formulas (5-4), (5-5), and (5-6) 
place the best-fitting straight lines in scatter diagrams. Formulas (5-7), 
(5-8), and (5-9) were used to illustrate the dispersion about the best- 
fitting straight line. A possibility that has not been mentioned so far is 
that the bivariate relationship might not be linear. A relationship is linear 
if the points in each array spread equally above and below the best-fit 
straight line. Technically, the relationship is linear if the mean assessment 
score for each array falls exactly on the best-fit line. 

In any particular scatter diagram, the array means can be plotted and 
compared with the best-fit straight line. It is expected that the array 
means will diverge from the line by small amounts if only because of 
sampling error. However, in the use of psychological tests it is rare to 
find that the array means show some systematic trend other than a 


Ficure 5-5, Scatter diagram illustrating a curvilinear relationship. 
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straight-line function. When this ha 
linear, an example of which is shown in Figure 5-5. Figure 5-5 shows that 


j r test scores up to a point and 
then start to decline. Then the people who make very high scores on the 
test do less well, as a group, on the assessment than the people who make 
only moderately high test scores, Although curvilinear relationships are 
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sometimes found in psychological experiments, such a marked departure 


from linearity as the one shown in Figure 5-5 is rarely found in using 
predictor tests. 

Formula (5-7) was illustrated with a linear relationship, but it can be 
applied equally well to curvilinear relationships like that in Figure 5-5. 


The only difference is in the way that the standard error of estimate is 
obtained, Instead of computing the statistic about the best-fitting straight 


line, s.a can be obtained about the array means. The mean assessment 
score in each array is subtracted from the assessment scores in each 
array, The differences are then squared and summed over all arrays. The 
resulting sum is divided by the number of individuals. This is then the 
squared standard error of estimate taken about the array means. When 


this is placed in Formula (5-8), it gives a general index of correlation 
which is independent of the form of the relationship. Formula (5-8) is a 
measure of correlation that can be used on relationships of all kinds. 
When the standard error of estimate is obtained about the best-fitting 
straight line, the resulting statistic is called the correlation coefficient and 
symbolized as r. When the standard error of estimate is obtained about 
the array means, the resulting statistic is called the correlation ratio or 
Eta. 

It must be kept in mind that the linear correlation formulas will place 
the best-fitting straight line in the scatter diagram regardless of whether 
the relationship is actually linear. Before applying the linear correlation 
formulas, the scatter diagram should be inspected to see if the relation- 
ship is reasonably linear. If there is only a moderate curve in the rela- 
tionship, the linear formula will still do an adequate job of describing 
the trend. However, if there is apparently a marked tendency toward 
curvilinearity, a test of statistical significance should be made (see 3, pp. 
255-262), Then if the departure from linearity is statistically significant, 
either the correlation ratio should be applied or people should be told 
that the linear correlation formula was applied to a curvilinear relation- 
ship. When correlations are stated in research reports, readers will as- 
sume that the relationships are reasonably linear. Among the other as- 
sumptions which are made in employing the correlation coefficient, r, 
linearity of relationship is the most important. 

If a linear relationship is found between predictor and assessment, this 
provides at least a practical justification of the original assumption of an 
interval scale for the predictor test. A linear relationship means that 

ctor correspond to equal intervals on the 


equal intervals on the predi 
assessment. Because a linear relationship is the usual and not the unex- 


pected finding in working with predictor tests, this shows that for 
practical purposes the assumption of an interval scale and the use of the 


statistics which depend on this assumption are often justified. 
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In order to use sest as a measure of the error in making predictions, the 
bivariate relationship should have several characteristics in addition to 
linearity. Earlier in the chapter there was a discussion of the errors of 
prediction in terms of the array dispersions. A possibility which was 
not mentioned at the time is that the array dispersions are not equal. An 
example is shown in Figure 5-6, Although relationships similar to the 


Ficure 5-6. Scatter diagram in which array dispersion increases with test score. 
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one in Figure 5-6 are sometimes found in practice, it is usually con- 
sidered an undesirable condition, The Gest Obtained for this relationship 
would be an underestimate of the error in predicting from high X scores, 
and an overestimate of the error in predicting from low X scores. The 
assumption in using the correlation coefficient is that the dispersion about 
the best-fit line is the same for all of the arrays on the X axis. The pres- 


ence of equal array dispersions is referred to by the rather formidable 
name of homoscedasticity. 


Not only is it desirable to have equal 
na points in each array are normally distributed about the best-fit line. 
Ba a me Ri E EEn Ds numerous where the best-fit line 
upward or pa ang progressively less numerous in moving 
frequencies ex eee ah cb best-fit line, in accordance with the 
EA nce pecte Tom UE normal distribution. Normality of array 

persions is necessary if the oes is used as a standard deviation type 


array dispersions, it is helpful if 
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measure, Only if the distribution is normal will 68 per cent of the cases 
fall between plus and minus 1 standard error, and so on for all other 
segments of the distribution. 

|) array dispersions are normal, the standard error of estimate can be 
usec to determine the probability with which predicted assessment scores 
will deviate from actual assessment scores by particular amounts. If, for 
example, a predicted score is 40 and the ops is 5, the odds are only 5 
in 100 that the actual assessment score is greater than 50 or less than 30 
(remember from Chapter 3 that only 5 per cent of the cases are beyond 
plus 2 and minus 2 standard deviations of the normal distribution). If 
the array dispersions are not normal, op. will lead to inaccurate prob- 
ability statements. It is fortunately the case that array dispersions are 
almost always normal. 

\nother feature which is desirable in the correlation problem, but not 
absolutely necessary, is that each of the two variables has normal dis- 
tributions of scores. In practice it is found that highly skewed variables 
and other forms that differ markedly from the normal are often accom- 
panied by one or more of the several undesired features in the bivariate 
relationship, such as nonlinearity. 

Regression. Looking back at Eq. (5-4), it can be seen that the best 
prediction is always that an individual will make a standard score nearer 
the mean of the assessment than his standard score on the predictor— 
except in the case of a correlation of 1.00. Also, because each of the 
standard scores on the predictor is multiplied by a constant, r, the stand- 
ard deviation of the predicted scores will always be smaller than the 
standard deviation of the real assessment standard scores. Because of the 
need to make bets nearer the mean of Y than oy would indicate, the 
best-fit line is often referred to as the regression line and Eqs. (5-4), 
(5-5), and (5-6) as regression equations. 

The observed tendency for predictions to regress toward the mean 
was the original stimulus for developing the correlation coefficient. For 
example, it was noted that the sons of very tall fathers tended to be 
shorter than their fathers, and that sons of very short fathers tended to 
be taller than their fathers. There is a simple reason why the regression 
tendency occurs. The people who make the highest scores on the predictor 
test can either make the highest assessment scores or they can vary 
downward toward the mean. Consequently, the array means of the 
extremes on the predictor regress toward the assessment mean. The in- 
between arrays tend to follow suit, resulting in the general tendency of 
regression toward the mean. 

The slant, or slope, of the best-fit line is evidence of the regression 
tendency. The slope indicates the number of units of X that are required 
to advance a unit of Y. When dealing with standard scores, the correlation 
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coefficient is the slope of the best-fit line. For example, when there is 
a correlation of .50, the best-fit line moves over two units on the predictor 
axis before going up one unit on the assessment axis. 


Correlational analysis can be used to predict the X’s from the Y’s as well 
as the Y’s from the X’s. It is arbitrary which variable is labeled X and is 
placed on the horizontal axis rather than labeled Y and placed on the 
vertical axis. In the example above, the height of fathers could be pre- 
dicted from the height of sons. The correlation coefficient will be the 
same regardless of which variable is labeled X. Also, we could predict 
test scores from grade point averages, although there would be no prac- 


tical purpose served. Correlation Formulas (5-1), (5-2), and (5-8) 
would be unchanged, and we would obtain the same coefficient of .78 
that was determined earlier, Regression, correlation, and standard error 
of estimate formulas (5-4) through (5-9) would be altered by sub- 
stituting Y’s for X’s and Y subscripts for X subscripts, and vice versa. In 
most testing problems it is important to estimate the assessment from 
the predictor test, but not vice versa. Consequently, the assessment will 
be labeled Y and shown on the vertical axis throughout the text, and 
the regression problem will be treated in terms of the prediction of 
assessment scores. 

Linear Transformations. One important property of the correlation co- 
efficient is that it is completely insensitive to linear transformations of 
the predictor and assessment scores, A linear transformation consists of 
adding a constant amount to each score, subtracting a constant amount, 
multiplying or dividing each of the scores by a constant amount. For 
example, the correlation in Figure 5-3 would remain .50 if the number 10 
million were added to each of the scores on the predictor test and each 
of the assessment scores were multiplied by 20 billion. The reason why 
the correlation coefficient is insensitive to linear transformations is that 
it is a measure of the relationship between two sets of standard scores. 
Either standard scores are used in the correlation formula or the con- 
version to stand 


ard scores is done in the deviation score and raw score 
computational formulas, 


THE STATISTICAL SIGNIFICANCE OF CORRELATIONS 
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correlation coefficient determined from a sample of persons mirrors the 
real correlation between variables in the whole universe of persons. 

To illustrate the influence of sampling error on correlation coefficients, 
imagine that we are trying to learn the correlation between height and 
weight for British male adults. It would be exceedingly laborious to 
measure the height and weight of every male above twenty-one years in 
England; and it would be no small amount of work to compute the 
correlation. If a research worker set out to learn the correlation between 

_ height and weight, he would probably work with only a sample from 
the universe. The sample might be quite large, say over ten thousand 
men, or perhaps only a few dozen men would be used instead. 


Ficure 5-7. Distributions of sample correlations when the universe correlation is .00. 
(N is the sample size.) 
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Even if the universe correlation is zero, the correlation found for a 
particular sample probably would not be exactly zero. We might draw a 
number of different samples of persons, say all of them with 100 persons, 
and run the separate correlations. It would be expected that these correla- 
tions would range around zero. There would be no reason to expect more 
positive than negative coefficients or vice versa. The dispersion of the 
coefficients about zero would be dependent on the N or sample size. That 
is, the correlations obtained from successive samples would crowd nearer 
to zero when the sample size is large—assuming again that the universe 
correlation is exactly zero. The distributions of coefficients that would be 
obtained for different sample sizes would look like those in Figure 5-7. 
Because the true correlation is zero in Figure 5-7, all of the non-zero 
coefficients are due to chance. On occasion these chance coefficients can 
be quite large. Notice that with an N of 10, correlations greater than 
+-.50 and less than —.50 would occur appreciably often purely by chance. 
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Before any effort is made to interpret a particular correlation, some 
assurance must be obtained that the coefficient is significantly different 
from zero, Otherwise the correlation may represent a chance sampling 
of people from a universe in which the true value is zero. One way to 
gain some confidence in particular correlations is to lear: the odds, or 


the probability, with which the correlation could have becn obtained by 
chance alone. If the probability is very low, say 1 in 100, or at the 01 
level as it will be called, then it is relatively safe to assume that the 
correlation is not merely a chance relationship. 

The Standard Error of the Correlation. In Chapter 3 it was shown how 
the normal curve can be used to determine probabilities. For example, 
it was said that only 32 per cent of the cases would be beyond plus and 


minus 1.00 standard deviations, and that only 5 per cent of the cases 
would lie beyond plus and minus 2.00 standard deviations. If it were 
known that chance distributions of correlations such as those shown in 
Figure 5-7 are normally distributed, it would be possible to determine 


the probability of getting correlations of certain sizes by chance. All that 
is required is to obtain the standard deviation of the distribution of 
coefficients. This can be obtained approximately as follows: 


1 
g= = (5-10) 
N 


The statistic o, is called the standard error of the correlation coefficient. 
Its use can be illustrated by returning to the correlation between the 
height and weight of men in England. If our sample consisted of 101 
persons, the standard error of the correlation would be: 


|= 


This means that, if the universe correlation is zero, the standard deviation 
of correlations obtained with samples of 101 persons will be .10, Then 
correlations obtained by chance not 
r words, only 32 per cent would ex- 


a correlation is obtained for a s: le. of size 
10 : a or a sample of $ 

1, the o, allows us to determine the odds that the sample was drawn 
rom a population which has a correlation of zero 
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It is customary not to accept a finding as statistically significant unless 
is at the .05 level or beyond. Looking at the per cents of cases in 
ious areas of the normal distribution in Appendix 3, it is found that 
per cent of the cases lie beyond plus and minus 1.96 standard deviations 
rom the mean. By multiplying 1.96 times the o, for a particular sample 
. the .05 level for correlation coefficients can be determined. When 
© N is 101 and, consequently, the o, is .10, the .05 level would be repre- 
cuted by correlations greater than +-.196 and less than —.196. Because 
l is customary to use only two decimal places for correlations, the .05 
level would be set at correlations of +.20 and —.20. Then if with a 
inple size of 101 persons a correlation as large as +-.20 or —.20 is found, 
it can be said that the odds are less than 5 in 100 that the sample was 
drawn from a population in which the correlation was zero. This is another 
way of saying that the particular correlation is significant at the .05 level. 
In a similar way it can be found that correlations of .26 or above and 
orrelations of —.26 or below are significant at the .01 level for a sample 
size of 101 (see Appendix 3), The standard error of the correlation can 
be found for any sample size by substituting the N in Formula (5-10). 

Because of an untenable assumption, the standard error of the corre- 
lation has only limited use. The formula assumes that when samples are 
drawn from a universe in which the correlation is zero the sample 
correlations will be normally distributed. This is not quite true for any 

ize of sample, and it is a particularly bad assumption for samples with 

less than thirty cases. The o, offers a good approximation to the needed 
rror term under two conditions: (1) when the problem is to determine 
the significance of a particular correlation from zero correlation, and 
(2) when the sample size is reasonably large, say with an N of 100 or 
more. 

Although o, provides a good estimate of the sampling error of correla- 
tion coefficients under the conditions stated above, there are more pre- 
cise error formulas that can be used (see 1, 2, 3, 6). The standard error 
of the correlation coefficient was described in order to illustrate tests of 
significance. Rather than go more deeply into the statistical checks which 
can be applied, we will discuss the implications of significance tests for 
measurement programs. 

Before any practical decisions are made about the usefulness of a 
predictor test, it must first be determined whether or not the test is 
significantly related to the criterion variable. Unless the correlation is 
based on a very large number of cases, say over 300, there is consider- 
able sampling error. Any decisions about using a test should be tem- 
pered by the size of the sampling error as well as by the size of the 
obtained correlation. Samples of less than a hundred persons have so 
much sampling error that they do not offer firm enough evidence for 


ra 
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the acceptance of tests. It is only as the sample size gets to be larger than 
several hundred persons that we can have confidence in the indicated 
validity. 

If the problem in a testing program is to decide whether or not to use 
a particular test, it must first be determined if the test-assessm nt correla- 
tion is significantly different from zero. If the correlation is not signifi- 
cantly different from zero, the predictor test should be held suspect. It 
is always possible to gather a larger sample and examine the significance 
of the new correlation. The burden should always be placed on the test 
to prove its utility. 

In some testing situations there is a minimum validity that a test must 
have in order to be usable. For example, it might be the case that the 
expense of using a test can be justified only if the test-assessment corre- 
lation proves to be greater than .30 in continued use. If on the first sample 
of test responses a correlation of .55 is found, a statistical check can be 
made to see if the correlation is significantly greater than .30. 

Another problem that will occur quite often in testing is that of decid- 
ing whether one test works better than another in predicting a particular 
assessment. For example, a vocabulary test and an arithmetic test could 
be given to fourth-grade students. It might be found later that the vocab- 
ulary test and the arithmetic test correlate .60 and .40 respectively with 
school grades. If the intention is to use only one test to assign children 
to class sections, it would have to be determined whether the apparent 
difference in predictive power could likely be due to chance. 


THE INFLUENCE OF THE DISPERSION ON CORRELATIONS 


An important property of the correlation coefficient is that its size is 
directly related to the standard deviation of the assessment. Looking back 
at Figure 5-3, what would happen to the correlation if only the people 
with grade averages above 2.67 were considered in the calculation? This 
would certainly reduce the standard deviation of grades, Y, but it would 
have relatively little influence on Sest Remember that the assumption of 
equal array dispersions requires that oest will be the same regardless of 
the region of predictor scores being considered. If only the people who 
score above 2.67 in grades were considered in the computations, the 
correlation would be smaller. Look back at Formula (5-7) and you will 
see that this is the only possible result. Whenever the dispersion of ability 
on the assessment is altered, the correlation changes. Be careful to note 
that the change in dispersion must be a change in sampling such that the 
range of “real ability” varies. As you remember, changing the size of the 
standard deviation by a linear transformation of the assessment scores 
will have no influence on the correlation. Whereas dest tends to be the 
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same in different samples of persons, the correlation varies with the 
assessment dispersion. 

Because of the influence of the assessment dispersion on validity, the 
standard deviation of the assessment scores should be compared with that 
vhich is generally found in the assessment situation. If the dispersion used 
in validation work is different from the usual dispersion found, then a 
tatistical correction gives an estimate of what the test validity would 
be with a more representative dispersion of assessment scores (see 5, 
pp. 169-176). 

The influence of the dispersion on test validity relates to an interesting 
md often misunderstood observation: intelligence tests become less 
successful predictors as higher educational levels are reached. The tests 
correlate around .70 with grammar school grades, around .60 with high 
school grades, around .50 with college grades, and they tend to correlate 
only slightly with the grades of students in graduate school. The reason 
for the progressive decline in validity is that the dispersion of intellectual 
ibility is gradually being decreased. The less able students are dropped 
out year by year. By the time graduate school is reached, there is rela- 
tively little dispersion of intellectual ability. Tests can “commit suicide,” 
if, as is often the case, the tests are used at various stages to determine 
who should continue in school. Theoretically the stage could be reached 
where the test correlates zero with the assessment which it has been so 
successful in predicting over the years. This is much like the physician 
who advises his patients so successfully on how to remain well that he 
Ends himself with no ailments to treat. 


THE SELECTION OF PERSONNEL 


The statistical tools which have been discussed in this chapter are all 
aimed at answering the question, “How good are particular tests?” In 
many situations, an ability test which correlates .50 with its criterion will 
be considered as showing “good prediction.” To find personality tests of 
equal predictive efficiency would be considered a mark of outstanding 
success. However, as you know by now, a correlation of .50 means that 
only 25 per cent of the variance in the assessment can be explained by 
the predictor. Another way of looking at it is that a correlation of .50 
substituted in Formula (5-9) shows that the standard error of estimate 
will be .87 as large as the standard deviation of the assessment. On the 
face of it, this seems like rather poor prediction and a discredit to psycho- 
logical tests. 

If near perfect correlation were the standard for psychological tests, 
then nearly all of them would be considered very poor. However, a test 
need not correlate highly with an assessment to perform the job that is 
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needed in many situations. How this is done can be demonst» ited with a 
typical personnel-selection situation. The number of people who apply for 
a job is usually larger than the number who are selected. The ratio of 
these is referred to as the selection ratio: 


number selected 
number who apply 


= selection ratio 


If it were necessary to hire all of the people who applied, the selection 
ratio would be 1.00 and there would be no room for the use of psycho- 
logical tests. Of the people who are selected for a job, some will succeed 
and some will not succeed. The unsuccessful employees may be released, 
shifted to other jobs, or remain on the job at an unsatisfactory level of 
performance. The number who succeed provides another important ratio: 


number who succeed : 
- = success ratio 
number selected 


To illustrate the importance of the selection ratio and the success ratio, 
consider a hypothetical situation in which 100 persons are selected at 
random from the group that applies for a job. Imagine that a predictor 
test has been given to the entire group that applied, which, o! course, 
could not have affected the random selection of persons. Suppose that the 
selected group is placed on the job and only 54 per cent of the people 
prove to be successful enough to continue. Later it is found that the pre- 
dictor test correlates well with job performance scores. This prediction 
situation is illustrated in F igure 5-8. 

Instead of using dots, numbers of subjects are recorded in the scatter 
diagram in Figure 5-8, Because 100 persons are shown, the numbers can 
also be interpreted as per cents. As would be expected from the mean- 
ing of the correlation coefficient, smaller and smaller per cents of fail- 
ure are found as higher scores are made on the predictor test. None 
of the people who make a score of 20 or below succeeds on the job. Of 
the people who score 50 on the predictor test, 50 per cent succeed. None 
of the people who score as high as 70 on the predictor fails on the job. 

The information in Figure 5-8 tells us that it would be foolish to go 
on selecting in a random manner, that the test can be used to improve the 
selection of employees, Just how successful the test can be in selecting the 
applicants is a function of the selection ratio. In this validation problem, 
25 per cent of the subjects made scores of 70 or higher, and all of those 
succeed on the job. Because the group was randomly selected, it might be 
expected that, on the average, an equal per cent of all new applicants 
would make scores in that range. Suppose that 100 persons are needed 
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for the job exemplified in Figure 5-8. Then, in order to select a group in 
vhich no failures are likely to be found, the number applying would 


have to be 400. 
Ficure 5-8. Relationship of predictor test scores to job success. 
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Usually it is not possible to find a combination of the proper correlation, 

ction ratio, and success ratio to obtain persons who all will succeed. 
jut it is possible, often with only moderate-sized correlations, to choose a 
substantial preponderance of successes (see 4). 

if the success ratio is large, there is no great need for an improvement 
in selection procedures. The extent to which the test raises the success 
ratio is the measure of the test’s utility in a particular situation. If the 
selection ratio is large, there is nothing that a predictor test can do to 
improve selection. One obvious way to improve the selection situa- 
tion is to obtain as many applicants as possible for the job, assuming that 
the new groups of applicants are of the same ability levels as the ones 
who have been applying. In many situations, an improvement in successes 
of only 10 per cent will save thousands of dollars ordinarily wasted in 
training persons who will not succeed, as well as sparing the persons con- 
cerned the pains of failure. 

In some selection situations, the success ratio tends to remain the same 
after predictor tests have been brought into use. This is often the case in 
school programs, where a certain per cent of the students are failed out 
as a matter of custom. If a valid predictor test is used in selection, either 
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the number of failures decreases or the standards for success shift 
upwards. 

Tests are used as often in vocational guidance as in selection. In voca- 
tional guidance there is no selection ratio to consider. Some g od advice 
must be given to each individual. For example, a student mi ht consider 
going into engineering on the basis of high scores on certain bility tests 
and the results of an interest inventory. When it is necessarv to make a 
decision about each person, not just select the special group who score 
very high on a predictor test, the error in prediction is considerably larger. 
Even if the predictor-assessment correlation is as high as .87, the standard 
error of estimate is one-half the size of the assessment standard deviation. 
Consequently, when it is necessary to advise separate individuals, much 
more confidence can usually be felt when dealing with persons [rom either 
extreme of the test continuum. Persons who make very low scores can 
safely be dissuaded from entering the particular vocation or school pro- 
gram, and persons with very high scores can safely be enconraged to 
follow their ambitions. 

Looking back at Figure 5-8, it is obvious that persons who score no 
higher than 80 on the predictor test would be well advised not to apply 
for that job; and persons who score as high as 60 could safely be advised 
to apply. The error in advising grows as we get into the “indifferent zone 
of scores between 30 and 60. If a person has a score of 50, giving advice 
on the basis of the test alone would be no better than flipping « coin. 

The region of maximum error in making decisions is not always in the 
middle of the predictor test continuum, If the job is very simple and, con- 
sequently, most persons succeed, the “indifferent zone” lies in the lower- 
score section of the predictor. If the job is so complex that only a minority 
succeed, the “indifferent zone” will lie in the upper-score regions. The 
farther people are on the predictor test continuum in either direction from 
the point of complete “indifference,” where 50 per cent succeed and 50 per 
cent fail, the safer it is to advise them or make decisions about them. 
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CHAPTER 6 


Reliability of Measurements 


Reliability concerns the precision of measurement regardless of what 
is measured, Some random error is involved in all scientific measurements. 
For example, a metal ruler expands and contracts with changing tempera- 
tures. This introduces some error into measurements and, consequently, 
lowers the precision of measurements made with this ruler. Measurement 
error of this kind tends to obscure scientific lawfulness. For example, if 
there is a close relationship between electrical resistance and length of a 
wire, the relationship would be somewhat obscured by measuring wire 
with the unstable metal ruler. This is why precise measurements can be 
made only by holding temperature constant or by using a measuring in- 
strument that does not vary with temperature. Psychological measures 
also contain a portion of random error analogous to that of the metal 
ruler. In this chapter we will discuss some of the ways of detecting and 
climinating measurement error from tests. 

A careful distinction should be made between test validity and test 
reliability. A test can have high reliability and not be valid for any par- 
ticular purpose. For example, we might use the weight of individuals to 
predict college grades. Whereas weight may be measured very precisely, 
and thus be highly reliable, weight would be invalid as a predictor of 
college grades. As will be shown in the pages ahead, the opposite is not 
true. In order for a test to be highly valid, it must be highly reliable also. 
High reliability is a necessary but not sufficient condition for high 
validity. 

Measurement Error and Predictive Efficiency. Figure 6-1 shows a hypo- 
thetical relationship between a predictor test and an assessment. As would 
be the case only in hypothetical circumstances, the test predicts the 
assessment perfectly, and consequently, all of the scores lie on a straight 
line. Let us see what happens when some error is introduced into the test 
scores. What we will do is to flip a coin for each score in turn; if it turns 
up “heads,” we will add three points to the score; and if it is “tails,” the 
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Ficure 6-1. Relationship between a predictor test and an assessment before measure. 
ment error is added. 
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score will be left as it is. Flipping the coin for each of the test scores in 
turn, we find the following: 


Original score Coin flip New score 
0 H 3 
1 T 
2 H 5 
3 H 6 
4 H 7 
5 ah 5 
6 T 6 
7 H 10 
8 H 1l 
9 H 12 

10 T 10 
1l iq ALE 
12 H 15 
13 T 13 


In Figure 6-2 the new scores are plotted, with the included error com- 
ponent, against the assessment. The assess 
been changed by the random additions 
Figure 6-2 it can be seen that there is no | 
effect of adding error to the scores is to lower the correlation. Starting off 
with a perfect relationship, there is no way for the relationship to change 
except to a lower correlation. The important point is that the addition of 
random error will tend to lower the correlation no matter what it is origi- 
nally. If we had pictured a scatter plot with a correlation of 50, the 
random addition of score points would have tended to make the correla- 
tion nearer zero. Likewise, if we had a correlation of —.50, the random 


ment scores have, of course, not 
to the predictor test scores. In 
onger a perfect correlation. The 


Reliability of Measurements 97 


iddition of score points would have tended to make the correlation 
wearer zero, 

Random error works to make an observed relationship less than it would 
c if the randomness were not present. This is why we speak of error, or 
reliability, as attenuating, i.e., lessening, a correlation. With as few 
cores as those in the example above, it is possible that the random 
hanges in scores would either leave a correlation unchanged or could, in 


} 
) 


cune 6-2. Relationship between a predictor test and an assessment after measure- 
ment error is added. 
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a very rare circumstance, make the correlation larger. However, the odds 
are against anything but a lowering of the correlation; and if the number 
of subjects is as large as it is in most testing situations (over 100 persons), 
it can be predicted not only that the correlation will be lower, but with 
fair accuracy, just how much it will be altered. 

A rule can be stated which will help us in discussing test reliability: 
If a subgroup of individuals is randomly chosen from a larger group, any 
differential treatment of the subgroup which tends to make the members 
all score higher or all score lower than if the differential treatment had not 
been applied will tend to lower the correlation between the scores of the 
total group and any other variable. There are many ways in which test 
scores can be randomly altered in this way. Suppose that you are develop- 
ing a test to predict school grades and that the whole group which you 
intend to test contains several hundred students. If some of the students 
are tested when they are tired, say after a long train trip, and others are 
tested after a good night’s rest, the scores of the “tired” group will tend 
to be lower. If some of the students are tested in a quiet, well-lighted 
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room and others are tested in a gloomy room adjacent to noisy machinery, 
the grades of the students in the less comfortable situation will tend to be 
lower. If the test instructions are given to one subgroup in a complete and 
clear manner, and if the test instructions are given to another subgroup in 
an incomplete and confusing manner, the grades of the latter subgroup 
will tend to be lower. If in each of these circumstances it can be thought 
that the students fall into the subgroups on a near random basis, the 
resulting influences on test scores will cause unreliability and will lower 
the ability of the test to predict school grades. 


THE THEORY OF MEASUREMENT ERROR 


It is easy to introduce error into test scores purposely, as was done in 
the example above, and to determine the effect it will have. | {owever, in 
nearly all practical situations, error is not purposely introduced into scores, 
and the problem instead is to try to determine the amount of error which 
has occurred in spite of the tester’s best efforts. 

Samples of Tests. In a number of places so far in this book, we have 
had to think of the content of a test, particularly of an assessment, as 
being a sample from a larger collection, or universe, from which items 
could conceivably be drawn. It was said that a sampling logic for test 
content is not as easy to maintain as is a sampling logic for universes, or 
populations, of persons, However, here again it will be necessary to dis- 
cuss content sampling to provide measures of reliability. 

Any test can be thought of as one test in a universe of possible tests. If 
an achievement examination is being constructed, say to measure prog- 
ress in spelling, many different tests of the same length could conceivably 
be made, Similarly with a predictor test, the individual form can be 
thought of as only one from a universe of tests which could be composed. 
Whether the test is an assessment or a predictor, the purpose is to esti- 
mate the score that the individual would make if he were given the entire 
universe of tests. If the score that an indivi 
is markedly different from that which he would make on another sample 
test, this would mean that either one or both tests are unreliable, 

The Estimation of “Universe” Scores. The problem is to estimate an 
individual’s score on a hypothetical universe of tests from the score which 
he makes on a sample test. (To keep in step with accepted terminology, 
universe scores will be referred to as “true” scores.) The most accurate 
way to estimate a “true” score would be to give the individual a large 
number of tests, say, 100 spelling tests over a period of several months, 
and then average the scores, It is expected that the errors would average 
out over time and over the different situations in which tests are adminis- 
tered, The individual who is lucky on one d 


dual makes on one sample test 


ay and manages to guess 
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some of the items would probably be unlucky on another day. The indi- 
vidual who is in either a poor mood or poor physical condition on one day 
would probably be more fit on another day. Similarly for the other sources 
f error variance, the errors would tend to average out. Consequently, the 

erage score over the 100 test administrations would tend to correspond 
losely to an individual's “true” score. 

\lthough measurement error would be considerably reduced if a large 
number of similar test forms could be used, this would be a thoroughly 
impractical procedure. In order to obtain estimates of “true” scores with- 
out this labor, some assumptions must be made. If sample tests all con- 
taining the same number of items were administered to a group of per- 
sons, the assumptions would be these: 

|. All sample tests have the same standard deviation. 

All sample tests correlate the same with one another (it does not 
matter what the correlation is). 

3. All sample tests correlate the same with the universe scores (it does 
not matter what the correlation is). These three assumptions are all that 
are required to deduce mathematically the formulas for determining the 
influence of measurement errors on test scores. This mathematical devel- 
opment is given in Appendix 4 for the reader who is interested. 

The Reliability Coefficient. A useful measure, if it could be obtained 
directly, would be the correlation of the scores from a sample test with 
the correlation of the universe or “true” scores. This would be sym- 
bolized as 


Tiu 


where r stands for the correlation coefficient 
subscript 1 stands for test 1 
subscript u stands for universe or “true” scores 


If the assumptions above are valid, then rı„ can be determined from the 
correlation of any two of the sample tests. (Each of the sample tests could 
be labeled 1,, l», 1,, and so on. The correlation between any two of these 
tests will be symbolized as r11, and we will keep it in mind that the letters 
a and b are omitted to simplify subscripts.) Then it is proved in Appen- 
dix 4 that 


Tiu = Vu (6-1) 


The correlation of two sample tests is called the reliability coefficient. 
Because its square root is an estimate of the correlation of either of the 
sample tests with the “true” scores, the reliability coefficient is said to be 
the per cent of non-error variance (this would be the same as the square 
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of rin and you will remember that the square of a correlation coefficient 
has meaning as a per cent of “variance explained”). 

The exactness of Formula (6-1) hinges on how well the assumptions 
above hold in particular testing situations; and because we are sure that 
they will not hold precisely, we must question the accuracy with which 
fiu is estimated in practical circumstances. First we will go on to develop 
measurement error statistics and return in the latter. part of this chapter 
to consider the practical problems in the estimation of reliability. 

The scores that people actually make on a test can be used to estimate 
their universe or “true” score as follows: 


Xu = T11%1 : (6-2) 


where x, is the individual's estimated “true” deviation score 
x, is the individual's deviation score on test 1 


111 is the correlation between any two sample tests from a content 
universe 


Tf an individual has a test deviation score of 10 and the reliability of the 
test is .90, the best estimate of his “true” score is 9. If another individual 
has a test deviation score of —20, his estimated “true” score is —18. The 
best estimate of the standard deviation of “true” scores is then 


Ou = 01V 111 (6-3) 


where o; is the standard deviation of test 1 
111 is the reliability coefficient 
ou is the estimated standard deviation of “true” scores 


Unless the test is perfectly reliable, the standard deviation of “true” scores 
will be less than the standard deviation of the sample, or obtained, scores. 
True” scores are of more theoretical than practical interest. People will 
retain their same relative score positions in respect to one another on the 
true” score continuum as they have on the test actually being used. 
Standard Error of Measurement. F ormula ( 
mate of “true” scores, The precision of the es 
well the reliability coefficient is dete 
the reliability coefficient will be disc 
if the reliability coefficient is deter 
there is still an element of error in the estimation of “true” scores. 
Looking back at Formula (6-2), it can be seen that “true” scores are 
estimated from the reliability coefficient, which is a special kind of cor- 
relation coefficient, In Chapter 5 it was shown that the correlation coeffi- 
cient not only provides a way of estimating one set of scores from another, 
it indicates the amount of error involved in the estimation: the standard 


6-2) provides only an esti- 
timate is dependent on how 
rmined. Some methods of estimating 
ussed in the following sections. Even 
mined in the most rigorous manner, 
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r of estimate. The reliability coefficient can be used in a similar way 
to estimate the error dispersion of sample scores about “true” scores: 


Omeas = ovI = Tii (6-4) 


where omeas is the standard error of measurement 
o; is the standard deviation of test 1 
rıı is the reliability coefficient 


Using the example above of a test with a standard deviation of 10 and 
a reliability coefficient of .90, the standard error of measurement would 
be obtained as follows: 


Omens = 10\/T — 90 
= 10/10 
= 10 X .333 
= 3.33 


It must be kept in mind that the standard error of measurement ranges 
about the estimated “true” score for an individual, not, as is often mis- 
tukenly assumed, about the obtained, or sample, score. 

I'he score that a person makes on a test should never be taken as an 
exact point. First, it is necessary to recognize that measurement error 
tends to push scores out in both directions from the mean. High scores 
are usually overestimates of ability and low scores are underestimates. 
Second, measurement error introduces a zone of uncertainty about each 
estimated “true” score, the size of which is indicated by the standard 
error of measurement. Consequently, an individual's score should be con- 
sidered as lying somewhere in a band along the score continuum. The 
width of the band which is used depends on the precision with which 
scores will be interpreted. If we set the odds as 1 in 100 that the indi- 
vidual’s score will not be overestimated or underestimated, the individual 
can be considered to lie in a band stretching 2.56 standard errors of 
measurement (see Appendix 3) above and below his estimated “true” 
score. 

The use of the standard error of measurement to set confidence inter- 
vals can be illustrated with the example above in which “true” scores for 
two persons are estimated to be 9 and —18 respectively and the standard 
error of measurement is 3.33. The .01 confidence level would then lie 8.52 
(2.56 times 3.33) score units above and below each estimated “true” 
score. For a person with an obtained deviation score of 10 and an esti- 
mated “true” score of 9, the confidence band would stretch from .48 to 
17.52. In other words, if the individual were given 100 sample tests which 
met the assumptions given earlier, we would expect to find only 1 of the 
100 scores either less than .48 or greater than 17.52. Because most test 
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scores are reported as integers, the confidence zone limits could be 
rounded to the nearest whole numbers of 0 and 18. The .01 confidence 
band for a person with an obtained score of —20 and an estimated “true” 
score of —18 would extend from —26.52 to —9.48. 

The measurement error confidence band is often indicated by stating 
the individual's score as so much plus or minus 2.56 standard errors of 
measurement. For example, an IQ might be reported as 120 plus or 
minus 10. However, a common mistake is to space the confidence zone 
symmetrically about the obtained score rather than about the estimated 
“true” score. The correct procedures are either to space the confidence 
zone symmetrically about the estimated “true” score or asymmetrically 
about the obtained score. Illustrating the former, the scores in the exam- 
ple above could be phrased as 9 plus or minus 8.52 and ~18 plus or 
minus 8.52. The latter approach would be to say that the scores are 
10 plus 7.52 or minus 9.52 and —20 plus 10.52 or minus 6.52. Only if an 
individual’s obtained score is exactly at the mean and, consequently, the 
deviation score is zero, would the confidence zone be symmetrical, Then 
with a standard error of measurement of 3.33 in the example above, the 
score would be stated as 0 plus or minus 8.52. 

The Correction for Attenuation. It was said earlier that sureliability 
tends to attenuate, or lower, the correlation of a test with any other 
variable. The theory of measurement error (see Appendix 4) allows us to 
estimate how much the unreliability influences the predictive power of a 
test. For example, if we have correlated the scores on a test with a set of 
assessment scores and if we have made an estimate of the test reliability, 
a prediction can be made as to what the test assessment correlation would 
be if the predictor were made perfectly reliable: 


ERE (6-5) 
fii 
where ry is the reliability estimate for test 1 
Ty is the correlation of test 1 with the assessment, 2 
Tuz is the estimate of how much 1 would correlate with 2 if the 
test were made perfectly reliable 


Formula (6-5) can be illustrated in the situation where a test correlates 
50 with an assessment and the test reliability is estimated as being 81: 


50 


Taa = —— 
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We would expect then that as the test is made more and more reliable its 
predictive validity will move toward .56. 

The correction for attenuation is often said to estimate the “true” rela- 
tionship between a test variable and an assessment; however, because of 
the assumptions which must be made, such estimates should always be 
taken with a “grain of salt.” Where the correction for attenuation is help- 
ful is in indicating the improvement in predictive validity that might 
follow from an improvement in test reliability. In the early stages of a 
testing program, it is often necessary to compose short and relatively 
unicliable tests. The correction for attenuation provides a helpful sugges- 
tion as to how well particular tests will work if they are lengthened and 
more highly standardized. 

Whereas it is important to consider the measurement error in predictor 
tests, it is equally important to consider the measurement error in the 
assessments they are meant to forecast. In many situations, the assessment 
is less reliable than the predictor test. This is particularly so when the 
assessment consists of impressionistic ratings by foremen and supervisors. 
The reliability of the assessment can be used in the correction for attenu- 
ation as follows: 


T12 


Tiu = 


(6-6) 
T22 
where 1, is the correlation of a test with a completely reliable assess- 
ment 
ry is the correlation of a test with obtained assessment scores 
fəs is the reliability of the assessment 


to 


If a predictor test correlates .32 with an assessment, and the reliability of 
the assessment is .64, the correction for attenuation would be as follows: 


32 
Ver 


1139 
ee 


= 40 


Whereas the correction for attenuation due to the unreliability of the 
predictor test is only an indication of the increased validity that might 
come from improving the predictor, the correction due to the unreliability 
of the assessment indicates how well the test “really” works at present. Pre- 
dictor tests often work much better than the test-assessment correlations 
indicate. If, as is often the case, assessments are relatively unreliable, the 
predictors actually do a much better job than the correlations show. It is 


Tin = 
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then quite sensible to correct for attenuation due to the assessment unreli- 
ability, as shown in Formula (6-6), to estimate the “real” test validity. 

Finally, a double correction for attenuation can be made which con- 
siders the unreliability in both the predictor and the assessment: 


Tie In A 
i — (6-7) 
Vr ree 
where Tw is the correlation between a perfectly reliable predictor anda 
perfectly reliable assessment, and other symbols retain the 
meanings given above 
For example, if a test correlates .36 with an assessment, and if the reli- 


abilities of the predictor and the assessment are .81 and .64 respectively, 
the correction would be as follows: 


Ey 86 
-VSIV 
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eS: 


Tuu 


The correction for attenuation can be made to estimate not only what a 
correlation would be if the test were made perfectly reliable, it can be 
used to estimate how much the predictive validity will rise if the reli- 
ability is raised by any particular amount (see Appendix 9). 


SOURCES OF UNRELIABILITY 


It was stated that any random influence on test scores will cause unreli- 
ability. There are many ways in which this can happen in practice. Some 
of the most prominent sources of unreliability are discussed in the follow- 
ing sections. 

Poorly Standardized Instructions. If parts of the test instructions are 
given orally by the tester, there is always the possibility that testers will 
vary the instructions to an extent which will introduce measurement error 
into the scores. The tester who gives inadequate instructions, failing, for 
example, to mention that the subjects can go back later to questions which 
are left blank initially, will penalize his group. If the tester is too lenient 
in setting the time limit for a test, the group will make higher scores than 
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would have been the case with a strict adherence to time. If the test 
instructions vary in any way from group to group, the results of the test 
will tend to be unreliable. 

When test instructions are not written out as part of the examination 
booklet, a standard set of instructions should be made for all testers to 
read to their groups. In addition, all testers should adhere strictly to time 
limits and other standardized procedures. 

Errors in Scoring Tests. On multiple choice tests, the errors in scoring 
ire purely mechanical. If the test is scored “by hand,” it is possible to 
accidentally score some “correct” answers as “incorrect” and vice versa. 
\lso, errors can be made in counting up the number of “correct” and 
“incorrect” answers for each person. If tests are machine scored, an 
improperly functioning machine can add a considerable amount of meas- 
urement error to test scores, 

The errors in scoring an “essay” examination are more subtle in charac- 
ter, but they also make for unreliability. One of the prime sources of 
unreliability in scoring essay examinations comes from the different grad- 
ing standards used by different instructors. If three instructors teach dif- 
ferent sections of the same course, they will probably have at least slightly 
different ideas about what kinds of answers merit good and bad marks. If 
cach instructor grades his own examinations, the grade that a person gets 
is partly dependent on the instructor he happens to have. 

Measurement error is also “built in” the individual instructor. If an 
instructor regrades a set of essay examinations after a period of time, 
wmd there are no identifying marks to indicate what the earlier grades 
were, the grades will be somewhat different the second time. Another 
source of measurement error is that the instructor often changes his 
standards as he goes through a set of essay examinations. He might have 
sirict standards at first; but as he goes through the papers and sees that 
none of them measure up, he is likely to become more lenient. Conse- 
quently, the grade that a student obtains is dependent on the chance 
appearance of his test in the order in which they are graded. Some ways 
of improving the reliability of essay examinations will be discussed in 
Chapter 8. 

Errors Due to Testing Environment. An effort should be made to test 
subjects under uniform conditions. Carried to extremes this rule becomes 
impossible to follow. However, gross differences in testing environment 
can be removed, such as testing one group in a poorly lighted, noisy room 
and testing others in more comfortable surroundings. 

Errors Due to “Guessing.” In Chapter 3, we discussed the effect of 
guessing on multiple choice tests and the inequities in scoring which can 
result from a failure to make corrections for guessing. The correction 
formula presented there is true only on the average. It is not capa- 


‘ 
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ble of specifying exactly how much an individual knows. Some indi- 
viduals may be “lucky” enough to guess considerably more than the 
expected number of items, and other individuals may guess fewer correct 
than would be expected on the average. Therefore, guessing on multiple 
choice tests adds a certain portion of unreliability to test scores. Also, 
because the individuals who know the least do the most guessing ( assum- 
ing that everyone tries every item), guessing tends to make lower scores 
less reliable than higher scores. 

A free-response test, one in which the subject supplies the correct 
answer, is theoretically more reliable than a multiple choice form. How- 


ever, there are few subject matters which can be cast in the free-response 
form, the requirement being that there be only one word or term which 
represents the correct answer. The point to consider is that multiple 


choice tests become more reliable as the number of alternative answers 
is increased. If the subject matter permits, a test will he more reliable if 
five or six alternatives are used for each item instead of only two 
or three. 

Errors Due to the Sampling of Content. If the object is to estimate an 
individual's score on a universe of content, the sampling of content for a 
particular test will introduce some unreliability. Every student has had 
the feeling that he was “lucky” on a particular test, that of the many 
topics in the course, the instructor happened to ask about things which 
he knew. Another student, with the same level of understandin g about the 
whole course material, may have been “unlucky” in that he happened not 
to know about the particular questions. 

As in all sampling problems, the more items there are, the less unreli- 
ability will come from the sampling of items. Generally speaking, longer 
tests are more reliable than shorter tests. In order to reduce the unreli- 
ability due to sampling error, as many items as possible should be used 
(and also the items should range broadly over the subject matter). How- 
ever, this is somewhat in opposition to reducing the unreliability due to 
guessing by making as many alternatives for each item as possible. Due to 
the difficulties in composing tests, and the limited amount of time avail- 
able for testing, some compromise must be reached between the need to 
have both many items and many alternative answers for each item. That 
is why many standard tests have sought a compromise solution in 
using about four or five alternatives, with as many items as time will 
permit. 

Errors in the Individual. A large portion of the measurement error in 
most tests is due to “chance” fluctuations in the individual which raise or 
lower his score. A fly lighting on the student’s test, a broken pencil, a 
momentary distraction, a concern about the next school dance, a mistake 
in marking the answer sheet, and many more such “chance” events can 
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influence test scores. Although such “chance” influences tend to average 
out over a long, well-standardized test, there always remains a component 
of unreliability due to errors in the individual. 

Errors Due to Instability of Scores. In most testing situations, an indi- 


vidual’s score is expected not only to represent his immediate standing 
but his standing for some time to come. The stability which is expected 
of scores is relative to the type of test. If students are given the same 
examination or a similar examination several weeks after the first testing, 
it is expected that the two sets of scores will correlate highly. Scores on 
intelligence tests are expected to remain relatively stable over long 
periods of time. If a child’s IQ goes up or down by as much as 10 points 
over a period of two years, it makes the test very difficult to use. On the 
other hand, there are measures which are expected to change in relatively 
short periods of time. If an experiment is being conducted to alter the 
attitudes of a group of students towards some political issue, it is expected 
that a test of attitudes given after the experiment will show differences 
from a test given before the experiment. Whenever a psychological process 


is being studied or when individuals participate in an experiment, it is 
expected that changes of some kind will occur. If measures remain abso- 
lutely stable during the study, then there are no research findings to 
report, except to say that nothing changed. 

Instability of test scores acts much like the other sources of error to 
reduce the reliability. If intelligence test scores obtained in the fourth 
grade are used to place children in special classes in the eighth grade, 
and if the scores have fluctuated markedly during that time, the test 
scores will do a poor job of classifying the children. If a child is being 
placed in a special class because the earlier testing showed an IQ of 125 
and if a test given four years later shows an actual IQ of only 110, the 
original test score would give a faulty picture of the child’s ability. It is 
theoretically possible that earlier scores will be more predictive than 
scores obtained at the time when decisions are made about people, but 
this is almost never found in practice. The safest assumption is that if 
scores are used over a period of time as an indication of an individual's 
standing, and the scores are found to be unstable, the instability makes 


for unreliability. 


ESTIMATING THE RELIABILITY COEFFICIENT 


What has been said so far about the sources of measurement error and 
the influence of measurement error on test scores depends on the theory 
of reliability, the conditions of which were given earlier. However, the 
theory is an idealization of the practical circumstances in which See 
used, and consequently, we can only estimate the reliability coefficien 
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rather than determine it directly. There are a number of ways of estimating 
the reliability coefficient, each of which has its own advantages and 
disadvantages. 

Retest Method. The simplest approach to determining the reliability 
coefficient is to give the same test on two occasions. The correlation 
between the two sets of scores will be an estimate of the r liability 
coefficient, r11, as it was denoted earlier. 

There are two important disadvantages in using the retest method. The 
first is that the obtained reliability coefficient will reflect little of the 
error due to the sampling of content. The second is that the individual’s 
memory of his answers to the first test administration is quite likely to 
influence the answers which he gives on the second test administration. 
Such influences should dwindle as the time between the two test admin- 
istrations is lengthened, and therefore, it is wise to wait at least several 
weeks before retesting. However, it is possible that memory will affect 
the second administration to some extent even if there are severa] years 
between testings. Memory works to make the two sets of test scores corre- 
late highly, and consequently, the reliability coefficient is usually an 
overestimate when determined by the retest method. 

Equivalent-form Method. Instead of using the same test on two occa- 
sions, it is better to use two “equivalent” forms (also called “comparable” 
forms and “parallel” forms). That is, instead of making up only one test 
form, two forms can be constructed which are very much alike. If, in 
constructing either an achievement test or a predictor, explicit sampling 
procedures have been used (test items have been sampled from a large 
collection of relevant items), an equivalent form can be obtained by the 
random selection of new items, However, in the construction of most 
tests, the sampling is not so exact: the items are simply collected across 
a broad range of the hypothetical universe of content without an explicit 
sampling. Then, to obtain an equivalent form, the test constructor must 
try to make up a test that “looks” as much like the first form as possible. 
Sometimes this is a fairly easy task, for example, where the test contains 
spelling items, vocabulary items, or arithmetic items, But in some of the 
other tests of ability, and in many of the personality tests, it is difficult 
to compose equivalent forms. 

When the reliability coefficient is obtained from equivalent forms, it 
will manifest more of the sources of measurement error, and more accu- 
rately, than any other method, It will contain all of the sources of meas- 
urement error found in the retest method and, in addition, will give an 
indication of the amount of error due to the sampling of content. Although 
the memory of one test form may give the individual a slight advantage 
on taking the equivalent form, the effect of memory on the equivalent 
form reliability coefficient is slight. Were it not for the expense and the 
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difficulties of making up equivalent forms, this method of reliability esti- 
mation would almost always be preferable to the retest method. 

Subdivided-test Method. Instead of making up equivalent forms, a 
compromise procedure has been to obtain “part scores” for different sec- 
tions within the same test. The most popular of such procedures, referred 
to as the “split-half” method, is to give the individual one score on all of 
the even-numbered questions in the examination and another score on 
the odd-numbered items. The two halves of the same test can then be 
correlated to obtain an estimate of the reliability coefficient. A statistical 
correction then must be made to estimate the reliability of the whole test, 
not just of the half-tests (the correction is given and discussed in Appen- 
dix 5). 

Practical considerations have caused the subdivided-test method to be 
used as extensively as it has. In many situations it is either too expensive 
to compose equivalent forms, or it is not possible to obtain the same sub- 
jects for a second test administration. However, there are some serious 
flaws in the use of the subdivided-test method. Although the method 
determines some of the error due to the sampling of content, it does not 
do this as well as the equivalent-form method. The subdivided test shows 
none of the error due to instability over time, because both halves of the 
test are given at the same time. For these reasons, the subdivided-test 
method usually gives an overestimate of the reliability. 

It is particularly misleading to use the subdivided-test method on a 
highly “speeded” test (see 1, pp. 110-114). Such a test would be, for 
example, one in which the subjects are asked to complete as many simple 
arithmetic problems as possible in a short period of time. If the problems 
are very simple, it is a test of how fast the individual can work. Then each 
person is likely to get nearly all of the items correct as far as he goes, and 
individuals will differ mainly in terms of how far along in the items they 
are when time is called. This works to make the individual’s scores very 
much alike on the split-halves. The odd and even scores will be alike on 
most of the items up to the point at which time is called, because most of 
them will be correct. Also, the odd and even items will be alike beyond 
that point, because they would all be incorrect. The split-half reliability 
estimate obtained on a purely speeded test, such as the one illustrated 
here, would be completely meaningless. Fortunately, there are very few 
tests which are purely concerned with speed. Most tests have some time 
limit simply as a practical way of getting the test completed, but time, as 
such, might be only a trivial influence on the scores. However, we should 
always be wary of split-half reliability estimates obtained on highly 
speeded tests. 

Item-interrelationship Methods. The individual who works in the field 
of testing will encounter special methods of reliability estimation based 
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on homogeneity, or amount of correlation among the item responses 
within one test (see 2, pp. 385-389). Because these methods work within 
the items on one test, they provide no indication of the fluctuation in 
scores over time. What these formulas do is to provide a conservative 
estimate of the subdivided-test type of reliability. The item-ivtcrrelation- 
ship formulas are the preferred procedures when the retest method is not 
advisable and when equivalent forms are not available. 

Measurement of Stability. Most psychologists prefer to consider the 
measurement of stability of scores over time as a separate problem from 
the measurement of the other sources of error variance. As was said pre- 
viously, stability is expected of some instruments and not of others. Also, 
whether or not it is necessary for test scores to remain stable ci pends on 
the way in which the test is used. If an instrument is used over a number 
of years to make predictions, then stability of scores is essential. 

It should be obvious that stability can be measured only by the retest 
and the equivalent-form techniques and not by the subdivided-test and 
item-interrelationship techniques. Stability can be measured cither by 
retesting or, preferably, by administering equivalent forms over the range 
of time in which the test is used to make predictions. The measurement of 
stability is of more theoretical interest than practical importance. It is 
usually unsafe to use test scores obtained at one time to make predictions 
or counsel individuals at a much later time. It is better to give tests at the 
time when decisions must be made about individuals. The measurement 
of stability is primarily important in helping us understand how human 
attributes develop and change. 

In using either the retest or the equivalent-form techniques for meas- 
uring reliability, it is customary to space the test administrations several 
days to a month or more apart. This is done not so much to measure 
stability over time but to measure “errors in the individual.” It is reason- 
able to expect day-to-day fluctuations in moods and physical conditions 
to influence test scores. If the two test administrations occurred on the 
same day or only several days apart, it would provide an insufficient basis 
for measuring the errors due to fluctuating moods and physical condi- 
tions. Also, if the time span between administrations is very short, indi- 
viduals will remember the first test and will tend to repeat their work 
habits, guessing behavior, and characteristic mistakes on the second test. 


THE HANDLING OF MEASUREMENT ERROR 
IN A TESTING PROGRAM 


Because of the tendency for errors of measurement to lower test validity 
regardless of the sense in which validity is determined, unreliability is 
bad” in any measurement situation. No definite rule can be stated as to 
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how high the reliability coefficient should be for a test, but in general one 
ispects a test that has a coefficient less than .80. Some of the better- 
‘andardized instruments have reliability coefficients over .90. 

lt must be remembered that measurement error is important only 
cause it attenuates test validity. If a predictor test has a high correlation 
vith its criterion, reliability is no problem. The test constructor is con- 
cerned with measurement error when a test fails to predict its criterion. 
lt is sometimes found that the low predictive power of the test is due to 
inreliability from one or more of the error sources described previously. 
By removing some of the sources of measurement error, the predictive 
power of the test can be increased. 

Increased Standardization. Rather than worry about unreliability after 
it has occurred, it is primarily important to remove as many sources of 
measurement error as possible when the test is being constructed. Any- 
thing that works to standardize the test instructions and procedures of 
.dministration, prevents errors in scoring, equates testing environments, 
and lowers the influence of guessing, will make a test more reliable. 
Errors in the sampling of content can be lowered to some extent by 
either an actual sampling of items or a careful effort to range the items 
icross a defined universe of materials. However, even with the best sam- 
pling of items, there will be some sampling error due to the limited num- 
ber of items in any test. 

It is often erroneously assumed that the need for test standardization 
requires that tests should be given under the most comfortable conditions 
possible. The requirement is that all subjects should be treated in the 
same way. A particular test may be most valid if the subjects are told 
little about the instrument or if they are purposely confused. It may also 
promote test validity if the subjects are purposely distracted and harassed 
while the test is being administered. 

Increasing the Number of Items. After all efforts have been made to 
increase test standardization, the reliability can be raised by making the 
test longer. The increased length will act to reduce the errors due to 
“guessing,” the errors due to the sampling of content, and the errors in 
the individual. If we make several reasonable assumptions about the 
effect of lengthening a test, it can be predicted how much the reliability 
will be raised by increasing the number of items (the formula is pre- 
sented and explained in Appendix 5). 

Residual Errors. In spite of the tester’s best efforts, there will remain 
some measurement error, and there will be considerably more in some 
tests than in others. In some of the personality inventories where the 
individual is asked vague questions, the scores will inevitably be unre- 
liable. Similarly, nothing can be done to remove errors due to instability 
over time, which are errors only in the sense that they hinder predictions. 


112 Tests and Measurements 


If the trait being measured fluctuates over short periods of time, the test 
will be of little use in predicting behavior. For example, it is relatively 
meaningless to give interest tests to grammar school students. if the pur- 
pose is to use the results in planning future training. The things that chil- 
dren say they will want to study and work at professionally undergo 
marked changes in relatively short periods of time. 

Interpretation of Reliability Coefficients. All commercial tests should 
report complete reliability data (see 4). If the test manual either gives no 
reliability information or states a reliability coefficient without saying how 
it was determined, the test should be viewed with suspicion. In particular, 
the manual should tell the standard deviation of scores for the group on 
which the reliability estimate was made. The reliability estimate is a ‘cor- 
relation coefficient, and like all correlations, its size is directly dependent 
on the dispersion, or standard deviation, of scores. The reliability coeffi- 
cient is directly meaningful only when the standard deviation of test 
scores is the same as the standard deviation for the group with which the 
test will be used. It was said in Chapter 5 that the standard error of esti- 
mate tends to remain unchanged with changes in test dispersion, but that 
the correlation coefficient tends to change with the dispersion. Similarly, 
the standard error of measurement tends to remain the same with changes 
in the dispersion of test scores, and the reliability coefficient hecomes 
larger as more diverse groups are studied. 

If a test is being constructed to predict success for college applicants, 
and a reliability study is being made of the test, the group used in the 
reliability study should have a standard deviation of test scores which is 
very nearly the same as that of the total group of applicants, If the reli- 
ability study is made on a group of freshmen, this will not include the 
individuals who were refused admission, and consequently, the standard 
deviation of scores in the reliability study will be too small. The reliability 
coefficient will then be an underestimate. If the standard deviations are 
known both for the whole group and the subgroup which is used in the 
reliability study, statistical corrections can be made (see 3, pp. 96-99; 
2, p. 392). 

There have been cases where the erroneously high reliability coeffi- 
cients which have been reported in test manuals were obtained by com- 
puting the coefficients on groups of subjects more diverse than those in 
which the tests are actually to be used. This has been done most frequently 
in determining the reliability of raw scores or mental ages on intelli- 
gence tests by using groups of children ranging in age from, say, nine 
through twelve years. This makes the standard deviation of scores much 
larger than if a representative sample is obtained at one age level only. 

If a test is to be used consistently, and especially if it is a commercially 
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published instrument, thorough reliability studies should be made. The 
results should then be fully reported, including (1) the method used to 
estimate the reliability coefficient, (2) the standard deviations of scores, 

i) the standard error of measurement, and (4) a description of the 
sample of persons used in the reliability study. 
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CHAPTER 7 


Multivariate Prediction 


The use of one predictor test to forecast an assessment was discussed in 
Chapter 5. In many testing programs there will be several predictor tests 
used, and there are often several assessments to which the predictors are 
applied. A typical situation is the selection of students for graduate 
school. A battery of tests might be employed, including, say, vocabu- 
lary, mathematical reasoning, mechanical aptitude, knowledge of current 
events, and a personality test of introversion. The tests could Dbe used to 
select students for different graduate schools, such as engineciing, psy- 
chology, mathematics, journalism, physics, and others. Multivariate pre- 
diction problems are much more complicated than the relatively simple 
strategy of using only one predictor test. This chapter will discuss some 
of the general principles and techniques for dealing with multiple 
measurements. 

Multiple Correlation. First we will discuss the situation in which sev- 
eral tests are used to predict one assessment. It is generally the case that 
scores from two or more tests can be combined in such a way as to 
improve the predictions that can be made from one test alone. Much the 
same is done in everyday life when we gather different kinds of informa- 
tion before making decisions. Suppose, for example, that you are thinking 
of buying a certain make of automobile. For one source of information 
you might talk with a person who owns one of the automobiles to get his 
reactions. Not being satisfied with the one source of information, you 
might discuss the automobile with a mechanic, take a trial drive, and 
consult the reports of consumers’ magazines. If all of these sources of 
information gave favorable indications about the 
feel much more confident in making the purchase. In this instance you 
would be using a number of sources of information to predict future satis- 
faction with the particular make of automobile, In like fashion, drawing 


information from a variety of tests enables us to make a better prediction 
of the performance of individuals. 


In order to discuss the use of ab 
it will be helpful to return to the 


automobile, you would 


attery of tests to predict an assessment, 
concept of “variance explained” as it 
114 
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vas mentioned in Chapter 5. There it was said that the squared correla- 

tion coefficient indicates the per cent of the assessment variance which 

can be “explained” by a predictor test. Partitioning variance in this way 

las its most direct meaning in the mathematical statistics of prediction, 

vid it is sometimes difficult to see its common-sense meaning. In Chapter 
the per cent of variance not explained was stated as 


a 
Per cent of variance not explained = —* 
=) 
y 


(More properly, it is a variance ratio, a fraction rather than a per cent; 
but we customarily read ratios of variances without the decimal point, 
or, in other words, as per cents.) By subtracting the per cent of variance 
not explained from 1.00, the per cent of variance explained is found: 


2 


Per cent of variance explained = 1 — —S* = r? 
Cc 


Y 


\nother way of looking at the problem is that there are individual differ- 
ences in assessment scores and that the size of these individual differences 
is indicated by the variance. The strategy is to use the individual differ- 
ences on the predictor test to predict individual differences, or the 
variance of individual differences, in assessment scores. The larger the 
per cent of variance explained, r°, the more effective the test is in pre- 
dicting individual differences on the assessment. 

One way to understand the prediction problem is to think in terms of 
geometric designs showing the amount of overlap of predictor tests with 
the assessment. Here we will use 
squares to represent the variances 
of the predictors and the assess- 


ment. Imagine that our purpose is x 
to predict college grade averages, h 
Vocabul AA Grad 
and the first test that we employ, a Se Yj S vrages 
Y 


vocabulary test, correlates .50 with 
grade averages. This means that 25 
per cent of the assessment variance ; : j 
is explained by one predictor, vos Biev Z, aes digr oing 
cabulary. The overlapping variance 4), assessment. 
can be illustrated as in Figure 7-1. 
The vocabulary square has been placed over 25 per cent of the area of the 
grade average square. The same picture would be obtained by laying one 
cardboard square over 25 per cent of the area of another cardboard square. 
Variance diagrams like that in Figure 7-1 begin to show their useful- 
ness when we consider the application of more than one predictor test. 
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Suppose that in addition to the vocabulary test, an introvcrsion test is 
given to the college students. It correlates .50 with grade averages and 
zero with vocabulary. In other words, introversion explains another 25 
per cent of the grade average variance. A diagram showing the overlap 
of both vocabulary and introversion on grade averages is shown in 
Figure 7-2. The vocabulary and introversion tests together explain 50 per 
cent of the variance of grade averages. 


Ficure 7-2. Variance diagram showing two tests which each correlate .50 with an 
assessment. The correlation between the two tests is .00. 


x A M 


Vocabulary Grade Introversion 
test averages test 


The combined variance explained by a number of predictor tests can 
be used to form a correlation coefficient. It is called the multiple correla- 
tion, and it will be labeled with a capital R instead of with ihe lower 
case r that is used for the bivariate, or “simple” correlation. Th symbol 
for multiple correlation in its complete form is Rp1s3-p. Th: symbol 
means the correlation of all the tests 1, 2, 3, up to p, with the assessment 
Y. The multiple correlation for the problem in Figure 7-2 would be sym- 
bolized as Ry.12. In Figure 7-2 the 50 per cent of the variance explained 
means that there is a multiple correlation of approximately .71, obtained 
by finding the square root of .50, : 

If a third test is found which correlates zero with both vocabulary and 
introversion, the variance of the assessment which it explains can be 
added to that explained by the other two. Imagine that a mathematics 
test is used which correlates zero with both vocabulary and introversion 
and .30 with grade averages, The three independent portions of the 
assessment variance would then sum to .59. The multiple correlation of 
vocabulary, introversion, and mathematics with grade averages is then 
77, the square root of .59, j 

If we could continue to find new tests which correlate zero with the 
tests already being used and Which correlate to some extent with the 
assessment, the per cents of variance explained would accumulate until 
the assessment is predicted perfectly. In this unusual circumstance in 
which the tests all correlate zero among themselves, the multiple correla- 
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tion can be obtained very easily. The squared multiple correlation equals 
the sum of the squared correlations of the predictors with the assessment. 

Correlated Tests. The strategy of employing a battery of tests to pre- 
dict an assessment is more complicated than the description given so far. 
If correlations among predictor tests were in fact zero, the multiple corre- 
lation would be easy to compute and easy to understand. However, 
correlations among tests are almost never zero, and in most cases, the 
correlations are substantial. Then the portions of variance explained by 
the different predictor tests are no longer independent and can no longer 
be summed to obtain the squared multiple correlation. Going back to 
the illustration in Figure 7-2, imagine that vocabulary and introversion 
each correlates .50 with grade averages but, now, imagine that the corre- 
lation between vocabulary and introversion is .50 instead of zero. The 
overlapping variances are shown in Figure 7-3. In Figure 7-3 each vari- 


Grode averages 


Ficure 7-3. Variance diagram showing 
two tests which correlate .50 with an 
assessment and in which the correlation 
between the two tests is also .50. 


Introversion 


Vocabulary 


able overlaps 25 per cent of the areas of the other two variables. Vocabu- 
lary covers 25 per cent of the area of grade averages and 25 per cent of 
the area of introversion. Similarly, introversion covers 25 per cent of grade 
average area and 25 per cent of the vocabulary area. With the addition 
of the correlation between vocabulary and introversion, Figure 7-3 can- 
not be used to obtain the multiple correlation directly. However, the 
diagram will prove useful in showing how a solution is obtained. 

In Figure 7-3 there are three areas in which grade average is over- 
lapped by at least one of the predictors. The area with vertical lines is a 
part of the assessment that is explained by vocabulary alone. The area 
with horizontal lines is a portion of the assessment that is explained by 
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introversion alone. The dotted area is a portion of the variance of the 
assessment that is explained in common by vocabulary anc introversion, 
In Figure 7-2 where there was no overlap between vo abulary and 


introversion, the areas of overlap with grade averages could be added 
to obtain a squared multiple correlation of .50. Because the predictors 
do overlap in Figure 7-3, it would not be correct to add the per cents of 


variance each has in common with grades. Instead, the correct estimate 
of the variance of grades explained by the two predictors must be some- 
what less than the 50 per cent that was obtained from F igure 7-2. To the 
extent that the tests correlate among themselves, the squared multiple 
correlation of the tests with the assessment will be different from the 
sum of the squared simple correlations of the tests with the assessment. 
First we will look at some examples of multiple correlations obtained 
where the correlations among predictor tests are not zero. Then, we will 


predictor test 1. The symbol rj» will stand for the correlation between the 
tests 1 and 2, Considering two tests each of which correlates .50 with an 
assessment, the multiple correlation for varying degrees of relationship 
between the tests can be seen as follows: 


(1) = .50 
Ty2 = 50 RSL 
Tyo = .00 

(2) Ty = 50 
Ty2 = 50 R= .62 
Tyo = 30 

(3) Tu = .50 
Ty = 50 R= .58 
Mig = (50 

(4) fu = 50 
Ty = 50 R= .54 
fia = .70 

(5) fa = .50 
Tyo = 50 i= 259 
Ti2 = .80 


In the examples above, it can be seen th 
dwindles as the correlation between the tes 
correlation between the two predictor tests is 


at the multiple correlation 
ts grows larger. When the 
.30, the multiple correlation 
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drops to .62. When the correlation between the tests reaches .70, the 
multiple correlation goes down to .54. From that point on the multiple 
correlation drops off very slowly, going down only 1 point as the 
correlation between the two predictor tests goes from .70 to .80. 

The multiple correlation would reach .50 only when the correlation 
between the two predictors became 1.00. A perfect correlation between 
the two tests could be pictured by placing the variance square for vocab- 
ulary completely over the variance square for introversion. Then they 
would both necessarily overlap the grade average area in exactly the 
same place and the same amount. Consequently the multiple correlation 
would have to be the same as the simple correlation of either predictor 
with grade averages. 

The multiple correlation will always be at least as large as the largest 
correlation between any of the predictor tests and the assessment. That 
is, if one of the tests correlates .50 with the assessment, no matter what 
other correlations are found, the multiple correlation cannot be smaller 
than .50, This is another way of saying that when using multiple correla- 
tion the addition of tests to a battery cannot lower the predictive 
efficiency that is already present. 

Multiple Regression. Now that some background has been obtained 
on multiple correlation, the technical aspects of the problem can be dis- 
cussed. The use of the correlation coefficient was discussed in Chapter 5 
in respect to the regression equation, or the equation for the best-fit line. 
The standard score form of the regression equation was given as 


, 
zZ =r 
Zy = TZ; 


When there is more than one variable to be used in predicting an assess- 
ment, the regression equation can be extended as follows: 


zi, = Biz + Boe + Bats + + + © + Božo (7-1) 
where z’ is an estimated standard score on an assessment (college 
grades ) 


zı is the standard score actually obtained on the first predictor 
(say, vocabulary ) 
Bı is the weight for the first predictor 


What the above equation says is that, for any p number of tests, there 
are weights by which to multiply the standard scores of the predictors 
such that the composite scores will give the best estimate of the assess- 
ment. The least-squares criterion is used as the standard for a best esti- 
mate.” The problem is then to find the £’s, called regression weights, or 
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beta weights. The answer is relatively simple when the problem entails 
only two predictors: 


_ ln — Tyee 
= Lera 
(7-2) 
_ Tyo — Tyrie 
Bz p = P10 
The formulas can be illustrated with the followin g correlations: 
50 — (.60 X .80) sa 
ma = 00) A= Tia. (30)? = 852 
Ty2 = .60 
60 = (50X30) 
Coe J= = 494 
t2 =.80 By 1— (.30)2 49 
The obtained beta weights can be substituted into the multi ple regression 
equation to predict assessment scores. For an individual with standard 


scores of 1.00 and .50 on tests 1 and 2 respectively, the prediction would 
be as follows: 


Z'y = .352 (1.00) +.494 (.50) 
%y = 60 


The prediction is that this person will make a standard score on the 
assessment of .60, 


n be obtained for any number of predictor tests. 
, the procedure for obtaining the weights grows increasingly 
laborious after more than two predictors are used. The procedures for 


obtaining beta weights for more than two predictors can be found in 
most advanced texts on Statistics (see 2, 4,7), 


After the beta weights are obtained, they in turn can be used to deter- 
mine the multiple correlation; 


R? = Bity + Borys pee Floty (7-3) 


ta weights and correla- 
illustrate the computations: 
R? = 352 (.50) -+ .494 (.60) 

Be = 479, 

R= .69 
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What the multiple regression equation does is to transform a multi- 
variate comparison to a bivariate comparison. That is, at the start of the 
problem each person has a number of predictor test scores. The multiple 
regression equation combines the separate predictor scores into one score 
(z). The question is then, “How well do the z’, scores correlate with 
the actual assessment scores (z,)?” The answer is given by the coefficient 
of multiple correlation, Formula (7-3). 

The Standard Error of Estimate in Multiple Regression. In Chapter 5 
the standard error of estimate was described in the bivariate prediction 
problem. A similar measure can be derived from a multiple correlation: 


Gest = oy \/1 — RE (7-4) 


The only difference between the formula above and that used in the 
bivariate regression problem is that the multiple correlation, R, is substi- 
tuted for the simple correlation, r. The standard error of estimate obtained 
with the multiple correlation has the same properties as that obtained 
in the bivariate problem. It can be used to set confidence intervals for 
predicted scores. The standard error of estimate with a multiple correla- 
tion of .69 would be: 


If predictions are being made of standard scores on the assessment, oy 
equals 1,00; and consequently, the standard error of estimate is .72. If we 
were predicting either raw or deviation scores whose standard deviation 
is 2.0, the standard error of estimate would be 1.44. 

Sampling Error of the Multiple Correlation. In Chapter 5 we discussed 
the sampling error problems in using simple correlations. The two most 
important considerations are: (1) determining whether a particular 
correlation is significantly different from zero correlation, and (2) deter- 
mining whether two correlations are significantly different from each 
other. These two problems also occur in the use of multiple correlation. 
The first type of problem can be illustrated with two tests that have no 
predictive validity for a particular assessment: the population values of 
the correlations between the tests and the assessment are zero. In any 
particular sample of persons from the population, the obtained correla- 
tions would probably not be exactly zero. More to be expected is that 
the correlations obtained from successive samples would range about 
zero, with a dispersion dependent on the sample size. 

If the multiple correlation formula is applied to the chance coefficients 
obtained in our hypothetical problem, it will make the most of the 
chance relationships. It was stated previously that the multiple correla- 
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tion will never be smaller than the largest predictor-assessment correla- 
tion. In nearly all cases, the multiple correlation will be larger than any 
of the simple correlations between the tests and the assessment, even if 
by only 1 or 2 points. The multiple correlation tends to “take advantage 
of chance” to obtain a maximized coefficient. For this reason, special 
formulas must be used to determine the significance of multiple correla- 
tions (see 2, p. 399; 4, pp. 276-280). A discussion of these formulas 
would be too technical for inclusion in this book, but they should be 
studied when need arises. 

The second problem, that of determining the significance of the differ- 
ence between two multiple correlations, is a very real, practical problem 
in testing. This chapter began by considering the likelihood that adding 
tests to a battery will improve the prediction of an assessment. Suppose, 
for example, that we are using a vocabulary test and a mathematical 
reasoning test to predict how well students will perform in psychology 
training. We might then want to add new tests to improve the power 
of prediction. For this purpose a personality test might be used. Imagine 
that the multiple correlation with the grades in psychology training is 
50 before the personality test is added and .58 after the personality test 
is added. Before we can consider the practical importance of the ap- 
parent gain in prediction, it must be determined whether or not the 
difference could likely be due to chance. A statistical check should be 
made to provide some assurance that the new test actually improves the 
prediction of the assessment. 

Because of the tendency for multiple regression analysis to “take ad- 
vantage of chance,” the beta weights obtained in a sample will produce 
a higher multiple correlation in that sample than they will when applied 
to a new sample of persons. This is the same as saying that the multiple 
correlation tends to “shrink” when the beta weights are applied to a 
new group of persons. In the example above, the beta weights applied 
to the same sample on which they were derived produced a multiple 
correlation of .69. If the beta weights are now applied to a new group of 
people, the predicted assessment scores will probably not correlate .69 
with actual assessment scores. The odds are that using the beta weights 
obtained from the first sample will lead to a lower multiple correlation in 
the second sample. The shrinkage is small if the first sample is large, say, 
over 300 persons; but if the beta weights are determined originally on 
as few as 50 persons, the shrinkage is often substantial (see 4, pp. 185- 
186). The amount of shrinkage is also related to the number of predictor 
tests being used. The more tests there are, the more opportunity there is 
to “take advantage of chance,” and consequently, the more shrinkage 
to be expected. The proper way to determine the predictive efficiency of 
a set of beta weights is to “cross validate.” That is, the beta weights 
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should be determined on one sample of persons and then tried out on a 
second sample. The correlation between the predicted scores in the 
second sample and the actual assessment scores is the best estimate of 
the predictive efficiency of the battery. 


SELECTION OF TESTS FOR A BATTERY 


When only one test is to be used in predicting a particular assessment, 
there is little difficulty in deciding how to choose the test. The test that 
orrelates the most with the assessment would be chosen. When a battery 
of tests is being formed, the matter of choosing tests is not so simple. The 
strategy is to select the fewest tests which will give the largest multiple 
correlation with the assessment. This may make for some apparently 
unusual choices in some situations. To picture one such situation, con- 
sider the following multiple correlation problem for two tests and one 
assessment: 


fy = .00 
r= 60 R=.70 
Nh. = 50 


The first thing that strikes the eye is that 1 correlates zero with the assess- 
ment. If this test were being considered separately as a predictor of the 
assessment, it would be of no use. However, the multiple correlation 
demonstrates that test 1 can be used to improve the prediction that is 
obtained from test 2 alone. A test which correlates low with the assess- 
ment and high with one of the other predictor tests is called a suppressor 
test. That is, it works to suppress, or cancel out, a part of the variance of 
ihe second test in such a way as to improve the multivariate prediction. 

In the example above the beta weights for tests 1 and 2 would be —.4 
and .8 respectively. The regression equation would then be: 


a, = —42, + 82, 


You can see how in this way test 1 works to suppress, or cancel out, the 
invalid variance in test 2. 

The example of the suppressor test is given to show that it is easy to 
be fooled about how useful a new test will be until a statistical analysis 
is undertaken. The complexity of intercorrelations among predictors and 
the assessment makes it difficult to judge the utility of a particular test 
without first working out the multiple correlation. 

Although the purpose of this chapter is to show the value of multiple 
measurements, it should be said that multiple regression analysis is not 
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a panacea for the person who works with tests. The usual experience is 
that the multiple correlation grows only very slowly as the bait vy of tests 
increases from three, to four, to five, and so on. A substantial increase in 
predictive efficiency is often obtained from adding a second test to an 
initially good predictor, and in many cases a third test will prove valuable, 
However, in only a few cases has it been found that using more than 
four or five tests adds significantly to the predictive efficiency of the 
battery. 

The apparent limit to the gain from adding new tests is not an intrinsic 
characteristic of the prediction problem. It is due to the fact that psycho- 
logical tests tend to duplicate each other’s content so much that it is 
difficult to find more than four or five truly different kinds of measures, 
There is always the hope and the possibility that a new type of test 
will add to the predictive power of existing test batteries. 

There is usually a practical limit to the size of a test battery. The 
question is whether or not the gain in predictive power obtained from 
using a new test is worth the expense and effort. For example, if a test 
raises the multiple correlation of a battery from .50 to .55, it can be asked 
whether this gain is sufficient ground for including the test in the battery. 
The answer to that would depend on the testing situation, For example, 
if the expense of training a person in a particular occupation is large, 
as it would be in most graduate school programs or as it would be in 
training a pilot in the Air Force, then a gain of 5 points in predictive 
power could well justify the use of the additional test. 


DIFFERENTIAL PREDICTION 


Earlier in this chapter it was said that the testing situation may entail 
a number of assessments to be predicted, as was illustrated with the 
selection of students for various graduate training programs. Suppose 
that only one test could be used to select persons for the different gradu- 
ate schools. The test would then have a separate correlation with the 
grades in each of the graduate training programs. It might be found that 
the test is better able to predict the grades of students in psychology 
than in physics. If the test correlates positively with all of the assess- 
ments, even though it correlates higher with some than others, then the 
higher the individual's score on the test, the more likely he is to succeed 
in any of the programs. An over-all index of aptitude for graduate school 
would be of some use but would be little help in making decisions about 
which students should be allowed to enter particular programs. 

Generally we would find a battery of tests, instead of a single test, 
being used to predict a group of assessments, For example, we might use 
a vocabulary test, a reading comprehension test, a mathematical reason- 
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ing test, and an interest test in a battery. A multiple correlation could 
he obtained between the test battery and each assessment. With each 
multiple correlation would be obtained a set of beta weights. There 

ould be a different regression equation for each assessment to be pre- 
licted, and the beta weights would probably not be the same in the 
prediction of the different assessments. For example, the vocabulary test 
might correlate highly with grades in the history department but very 
little with grades in engineering. Then the beta weight for vocabulary 
would be different in the prediction of success in the two programs. A 
prediction could be made as to the program in which an individual would 
most likely succeed by trying out his test scores in the regression equa- 
tions for each assessment. He would then have a predicted score for each 
program. The student could then be advised that, whereas he probably 
would not succeed in certain school programs, he probably would make 
the grade in others. 

In the actual use of test batteries to predict multiple assessments, a 
number of nontest factors must be taken into consideration. If persons 
were being allocated to different training programs in the Armed Forces, 
it would be necessary to consider the time and expense which is involved. 
IF the training program is relatively expensive, it would be wise not to 
accept an individual unless his predicted performance is very high. It is 
also necessary to consider the number of men needed in different special- 
ties and the number of people who can be trained at any one time. If 
there is a shortage of, say, airplane mechanics, it would be wise to lower 
the standards somewhat by “gambling” on persons whose test scores fore- 
cust only moderate success. In nearly all situations the individual's own 
interests and desires are taken into account to at least some extent. 

We need differential prediction because different jobs often require 
different abilities. For example, it may be that high scores in the vocabu- 
lary test and the reading comprehension test are essential for a person 
in journalism, but it might matter little what score the person makes in 
mathematical reasoning. If we find a person who scores high in the first 
two tests, he would be a good bet for the journalism department even 
though he has a very low grade in the mathematical reasoning test. If 
instead of using differential prediction, the student’s test scores are 
averaged to form one general index of ability, the score in mathematical 
reasoning might make the average very low. On the basis of the general 
index of ability, the student might be refused admission to graduate 
school, without account being taken of the fact that his pattern of abil- 
ities makes him suited for one of the programs. Whenever there are a 
number of assessments in the selection situation, the potential advantage 
is with differential prediction rather than with the use of only a single 
test variable. 
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DISCRIMINATORY ANALYSIS 


A handy way to study an individual’s scores on a battery of tests is to 
present the information in the form of a graph, or profile, as shown in 
Figure 7-4. Figure 7-4 shows the standard scores for two persons on five 
tests. There are a number of characteristics of profiles that prove helpful 
in testing (see 1, 5). 

Ficure 7—4. Test profiles for two persons. 


Test 


Vocabulary 


Mathemotical 
reasoning 


Current events 


Mechonical 
optitude 


Extroversion 


=J -2 -1 te) 1 2 3 
Stondard scores 


Level. One important feature of a test profile is the level, or the extent 
to which the person does well on the tests as a whole. The level is 
defined as the mean standard score made on the tests by an individual. 
From Figure 7-4 we find that person a has a mean of .78 and person b has 
a mean of —.01. Then it is said that the level is higher for a than for b. 

The profile level provides direct information about people only if all 
of the tests correlate positively with the assessment being predicted. If 
all the tests correlate positively with the assessment, high test scores are 
“good” in the sense that they portend successful performance. If the 
profile concerns scores on tests of the ability type only, it is usually the 
case that all tests correlate positively with the assessment, even if the 
correlations are very small in some cases. Therefore, the level is usually 
a meaningful measure when all of the tests concern human abilities. 
Some cautions must be exercised in interpreting the profile level if some 
of the tests measure nonability type functions, such as interests, attitudes, 
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and social traits. In Figure 7-4 there is one test of the nonability type, 

the test for extraversion. Consequently, adding in extraversion scores with 

the four tests of the ability type would be meaningful only if extraversion 
re important in the particular assessment being predicted. 

if all the tests correlate positively with the assessment, and if we have 
no information about people except the profile level, we would choose 
tLe people with the highest levels. A high profile level means that the 
person scores generally high on all of the tests, even though he may 
score higher on some tests than on others. 

Scatter. Scatter concerns the extent to which an individual does better 
on some of the tests than on others. The standard deviation of the in- 
dividual’s scores on the separate tests serves as an index of scatter. In 
ligure 7-4 each person has five standard scores. The scatter for each 
person is computed by substituting his five standard scores in the stand- 
ard deviation formula. Computing these indices for the two profiles in 
Figure 7-4, there is a standard deviation of scores for person a of 1.62 and 
a standard deviation of 1.08 for b. This shows that the scatter for a is 
more than that for b. An approximate measure of scatter can be obtained 
by using the range of the person’s scores. Person a does best on the 
current events test and worst on the mechanical aptitude test. Subtract- 
ing the lowest score from the highest, a range of 4.0 is found for person 
a. In a similar way, a range of scores of 2.8 is found for person b. In 
comparison to other persons with the same profile level, a person with 
a wide scatter is likely to do relatively well on some jobs and relatively 
poorly on others. A person who has a very small amount of scatter is likely 
to do as well, or as poorly, on all of the performances being predicted. 

Shape. There is one more important type of information in the profile, 
which is called the profile shape. The shape concerns the ups and downs 
in ability shown in the profile. The shape is defined as the order of test 
scores for the individual. Person a makes his highest score on the current 
events test, his next highest score on the vocabulary test, and so on to 
his lowest score which is made on the mechanical aptitude test. In addi- 
tion to the differences between persons a and b in terms of level and 
scatter, there are differences in profile shape. Person b makes his highest 
score on the mathematical reasoning test and his lowest score on the 
vocabulary test. The profile shape offers a way of fitting an individual 
to a particular job. In so far as the individual does better on some tests 
than others, he can be placed on a job which capitalizes on his strongest 
points. The use of profiles in this way is an alternative to the multiple 
regression approach. 

The reader should be warned not to take the term shape to mean 
literally anything about the physical appearance of the profile. The shape 
refers strictly to the individual's rank-order of performance on the tests. 
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A naive individual studying a test profile might mistakenly impute mean- 
ing to the “picture” represented by the profile. If the individual scored 
lower on the even-numbered tests than on the odd-numbered tests, this 
would give the profile a jagged appearance. This might mistakenly be 
assumed to represent something about the individual. However, the 
ordering of tests in the profile is arbitrary. The tests can be reordered to 
make the profile “look” quite different. However, this will have no effect 
on shape, as it is defined above, or on level and scatter. 

The Standard Error of Difference Scores. Before practical use can be 
made of profile information, it must be determined whether differences 
in test scores within the profile are statistically significant. For example, 
it can be seen from Figure 7-4 that person a makes a standard score of 
2.5 on the vocabulary test and .5 on the mathematical reasoning test. If 
it can be shown that this is a real difference, a difference not due to the 
unreliability of the tests, the difference would be useful in predicting 
how well person a would do on particular jobs. 

In Chapter 6 we discussed the reliability of separate tests. Difference 
scores also have some measurement error, or unreliability. IF we gave 
person a the test battery on two separate occasions, we could sce if the 
difference between the scores on the vocabulary test and the mathemat- 
ical reasoning test would still be 2.0 standard score units. Perhaps the 
second time the difference score would be 1.5 or 2.5. 

The reliability of a set of difference scores ! can be determined from 
the separate reliabilities of the tests as follows: 


T11 F T22 — ry» 
fait = TAa (7-5) 
where fair is the reliability of differences between standard scores on 
tests 1 and 2 
111 is the reliability of the first test 
Too is the reliability of the second test 
fiə is the correlation between the two tests 
Let the reliability of the first test be .85, the relia 
be .95, and assume that there is a correl 
These values can be substituted into the 


ibility of the second test 
ation of .65 between the tests. 
above equation: 
85 + 95 — 2(.65) 

S =—. = .71 

2(1 — 65) 
You are probably struck by the fact th 
scores is less than that of eithe 


Tait = 


at the reliability of the difference 
r of the separate tests. This will usually 
be the case when the correlation between tests is positive. It can also be 
seen from inspecting the formula that the reliability of a set of difference 
scores decreases as the correlation between the tests becomes larger. 


The formula assumes that differences are obtained between sets of standard scores. 


= 
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flius, if in the illustration above the correlation between the two tests 
had been .75 instead of .65, the reliability of the difference scores would 
drop to .60. It would have worked the other way had the correlation 
between tests been negative. In that case, the difference scores would 
have been more reliable than the average reliability of the tests. 

The reason for the decrease in reliability of difference scores as the 
correlation between tests grows larger is that as the correlation increases, 
the difference scores become smaller and smaller. If the correlation be- 
tween two tests is 1.00, people have the same standard scores on both 
tests, and consequently, the score differences are all zero, If the correla- 
tion were —1.00, the differences would be as large as possible and, con- 
sequently, as reliable as possible. 

lt is unsafe to interpret difference scores unless the individual tests are 

highly reliable and the correlations among tests are relatively low. In 
the extreme case where the correlation between two tests is the same 
as the average of their individual reliabilities, the reliability of difference 
scores is zero. This would happen if, for example, the respective re- 
liabilities are .70 and .80, and the correlation between tests is .75. Fortu- 
nately, such extreme cases are unlikely to occur in the use of tests; but 
it is sometimes the case that correlations among tests are so high and 
individual reliabilities are so low as to make difference scores highly 
unreliable. 
Prediction from Test Profiles. Assuming that the cautions regarding the 
reliability of difference scores have been heeded, the individual’s test 
profile offers a basis for predicting how well he will perform in particular 
situations, A useful beginning is to compute a composite profile for each 
assessment to be predicted. This would be the profile of mean scores for 
the persons who perform well. For example, if we study the mean test 
scores of successful policemen, it might be found that they make above- 
average scores in vocabulary, high scores in current events, low scores 
in mathematical reasoning, and so on. The policemen could be compared 
with a group of successful accountants, who might be found to make 
high scores in mathematical reasoning, low scores in extraversion, high 
scores in current events, and so on. For each group, the mean test scores 
can be plotted as a profile, a composite profile for each job. In Figure 7-5 
hypothetical profiles have been drawn for the two groups, and the profiles 
for two persons, a and b, have been included to show how the composite 
profiles are used in making predictions. (Persons a and b shown in Fig- 
ure 7-5 are not the same as persons a and b shown in Figure 7-4.) 

A most useful type of information would be the ideal profile for each 
job. This would be found from the mean test scores of excellent police- 
men and excellent accountants. Then, if an individual's test profile closely 
resembles the ideal profile, it is a good bet that he will do relatively well 
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on the particular job. Although we do not have the ideal profile for each 
job, we do have the composite profile of successful men. In other words 
we have the profile of satisfactory policemen but not necessarily that of 
excellent policemen. 

Before people can be classified for different jobs, some way is needed 
of determining how similar one test profile is to a composite profile. This 
might be done impressionistically by simply looking at the similarities and 


Ficure 7-5. The test profiles of two persons compared with the composite profiles for 
policemen and accountants. 


Vocabulary 
Mathematical 
reasoning 
Current events Person g. 

Accountants 
Mechanical 
aptitude 
Extraversion 

Standard scores 

Policemen Accountants Person @ Person b 

Vocabulary 14 19 2s 0.5 
Mathematical 
reasoning -05 24 2.8 05 
Current events 1.7 20 1.2 1.4 
Mechanical 
aptitude OM -0.8 -05 0.7 
Extraversion 2.2 “a =0}2 18 


differences, Looking at Figure 7-5 it can be seen that the profile for 
person a is more similar to the composite profile for accountants than to 
the composite profile for policemen. The profile for person b appears 
more similar to the profile for policemen than to the profile for account- 
ants, Although in extreme cases a simple inspection of similarities and 
differences would suffice to assign people to jobs, it would not be sat- 


isfactory for less clearcut cases. What is needed is a measure of the 
similarity between test profiles, 
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In order to explain a technique for measuring the similarity between 
profiles, it will be helpful to give a geometric interpretation of test 
scores. In Figure 7-6 the mean scores are shown for policemen and ac- 
countants and for persons a and b on the mathematical reasoning and 
extraversion tests only. We could have plotted all of the successful ac- 
countants’ scores on the graph instead of only their mean scores. Then 
the accountants would have formed a swarm of dots extending out 
around their mean location. The plot of a set of mean scores is referred 


Ficune 7—6. Graphic representation of scores on the extraversion and mathematical 
reasoning tests. 
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to as a group centroid. It is the point about which the scores for the 
group balance in all directions. Likewise, the mean scores for the police- 
men define a centroid, one that is different from that for accountants. If 
each person in the group of accountants had been plotted instead of the 
group means, it would have shown the density of the group, or the degree 
to which the group is spread out around the centroid. It might have 
been found in this way that the accountants are more densely ordered 
about their centroid or more homogeneous in respect to test scores. 
Instead of plotting the scores for only two tests, a three-dimensional 
model could have been constructed to show the scores for three of the 
tests. This could be done with a boxlike structure in which balls are 
scores on three tests. In this way we could obtain 
f the scores made by the two groups and by 
persons a and b on the vocabulary, mathematical reasoning, and extra- 
version tests. The mean group scores would then locate centroids in three 
dimensions. If each of the persons in the group of successful accountants 
was plotted instead of only the centroid, there would be a scattering of 


hung to represent the 
a physical representation 0 
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the persons about the centroid in all directions. The method that will be 
used to. compare test profiles and the statistics that will be developed in 
respect to the method can be employed with any number of tests. Of 
course, when the number of tests is more than three, there would be no 
way of constructing a physical model of the scores; but this will in no 
way affect the generality of the concepts and mathematical procedures 
that apply. The mean test scores for a group of persons on five. six, or 
any number of tests can still be thought of as defining a centroid in that 
number of dimensions. 

The Distance Measure, D. In Figure 7-6 it is apparent that the cen- 
troids are different for accountants and policemen. It would be helpful 
here to have a measure of how far apart they are. The statistic that has 
proved useful for this purpose is called the D measure. It could be used 
quite simply to determine the distance of the two groups from each other 
as follows: 


ia (Ot, Maas CM, =M (7-6) 


where Dap is the distance between the centroids of accountants, a, and 
policemen, p 
Ma is the mean score of accountants on test 1 
M,, is the mean score of policemen on test 1 
M2, is the mean score of accountants on test 2 
Mz, is the mean score of policemen on test 2 


What the formula says is that the sum of the squared differences between 
the means equals the square of the distance between the two centroids. 

The D measure is a use of the 

Accountants Pythagorean theorem, which says 

pees that for a right triangle the sum of 

groups the squares of the sides is equal to 

the square of the hypotenuse. How 

this works is illustrated in Figure 

ws paeaiaea Policemen 7-7. The D measure can be applied 
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the mean for accountants on extraversion. The square root of this quan- 
tity is then the distance of person a from the centroid for accountants: 


D? = (2.8 — 2.4)? + (—.2 + 1.2)? 
(.4)? + (1.0)? 
16 +1.0 
= 116 
D = 1.08 


The first step in making decisions about people with the D statistic is 
to compute the distance of each person’s score profile from the composite 
profile for each job. The individual would then be assigned to the job 
whose composite profile is closest to his own profile. This would be very 
easy to do in the situation in which there are only two jobs to consider 
and only two tests on which to base the decision. It would only be neces- 
sary to find the distance of a person from the two centroids. Considering 
only the mathematical reasoning and extraversion tests, person a is 4.08 
from the centroid for policemen and 1.08 from the centroid for account- 
ants. 

There will generally be more than two tests in discriminatory analysis. 
The D measure can be used with any number of tests by expanding the 
formula as follows: 


D} = (Rig =a eet (ar Roach Pi nr (Za al (7-7) 


ab 


where Das is the distance between persons a and b or the distance be- 
tween the centroids of groups a and b 
Ziq is the standard score of person a on test 1 or the mean stand- 
ard score of group a on test 1 
zı» is the standard score of person b on test 1 or the mean stand- 
ard score of group b on test 1 


The formula says that the distance between two persons, or the distance 
between the centroids of two groups, is determined by squaring the differ- 
ences on the separate tests, summing these, and taking the square root of 
the sum. In this way it is found that over the five tests person a has a dis- 
tance of 1.50 from the centroid for accountants and a distance of 4.59 
from the centroid for policemen. Using this procedure, person a would be 
classified as a potential accountant rather than as a policeman. (Some 
criticisms of this method of classifying people will be given later. ) 

The D measure could be used to discriminate more than two groups. 
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For example, in addition to policemen and accountants, there might 
be composite profiles for successful social workers and for successful 
plumbers. For each group there will be a centroid. If the centroids are 


different, the individual can be classified in terms of the centroid which 
his score combination most nearly approaches. Theoretically it is possible 
to classify a person on the basis of any number of tests into one of any 


number of groups. In practical problems the number of groups will usu- 
ally be limited to only several, and the number of tests will not generally 
be more than a half dozen. 

The Linear Discriminant Function. Although the D measure provides 
an approximate solution to discriminatory analysis, it has a number of 
shortcomings. If tests correlate with one another, and substantial correla- 
tions are usually expected, the distance between persons on one test will 
tend to be perpetuated on other tests. Therefore, what appears to be a 
large distance between persons, or between groups, may be only a reflec- 
tion of the correlations among tests. For example, if two persons have 
widely different scores in vocabulary, they are also likely to have widely 
different scores in similar tests such as reading comprehension and verbal 
analogies. Even if the two persons had very similar scores in mathematical 
reasoning and extraversion, the differences due to the three verbal com- 
prehension tests would make the D large, which would give a distorted 
picture of the actual differences between the two persons. 

Another drawback to using the D measure as a means of classifying 
people is that it does not fully take account of the degree to which differ- 
ent variables differentiate one job from another. For example, it may 
be that there is a considerable dispersion of mechanical aptitude among 
successful policemen. This means that mechanical aptitude is not an 
important variable in choosing policemen. Then if an individual makes a 
very high or very low mechanical aptitude score, the D statistic will be 
affected by the difference between the individual’s score and the mean 
score for policemen. The more proper approach is to weigh each test in 
terms of the degree to which it discriminates one job from another. 

Another factor that must be considered in discriminatory analysis is 
that of obtaining inferential statistics which will tell the probability that 
persons are correctly classified. The standard error of estimate is the 
analogous statistic used in correlational analysis. These statistics must 
consider the relative density, or scatter, of the members of each group 
about their centroid. In order to overcome the shortcoming of the D meas- 
ure, more elaborate procedures must be used. The most widely used pro- 
cedure is the linear discriminant function (see 3, 6). The more complex 
procedures are modifications of the D statistic rather than basically dif- 
ferent approaches. Whereas the D statistic is a useful approximation, the 
more complete procedures should be sought when need arises (see 6). 
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A COMPARISON OF REGRESSION AND DISCRIMINATORY ANALYSIS 


Two different ways have been described for dealing with multiple 
measurements: multiple regression and discriminatory analysis. It has not 
yet been made explicit how the methods differ. The major difference is 
that the multiple-regression methods assume that the ideal profile 
touches only the extremes of the test continua, That is, the multiple- 
regression methods assume that the “more the better” on each test regard- 
less of the job, or task, being predicted. The reason for this is that the 
regression methods assume linear relationships. If there is a linear rela- 
tionship between a predictor and an assessment, then the higher the score 
made on the predictor, the better the indication that the person will suc- 
ceed, This means that the ideal score is the highest that can be made on 
the test. It may be, of course, that the correlation between test and assess- 
ment is negative. This would mean that the lower the score made by the 
individual the better is his chance of succeeding. If there is a linear rela- 
tionship between each test and the assessment, the ideal profile would 
touch one end of each test continuum. 

When dealing with human abilities, linear relationships are usually 
found between tests and assessments. However, this is not necessarily 
the case. For example, it may be that very high “general intelligence” indi- 
cates that a person will do poorly in very routine tasks such as those of 
the truck driver or elevator opera- i 
tor. In the areas of interests, atti- 
tudes, and personality characteris- 
tics it is more likely that some of 
the relationships between predictors 
and assessments will be nonlinear. 
A curvilinear relationship would be 
expected between school grades 
and the amount of anxiety toward 
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school work, as pictured in Figure 
7-8. 

In Figure 7-8 the hypothesis is 
that the persons who make the high- 
est grades have a moderate amount 


2 


Low High 


Anxiety 


Ficure 7-8. Hypothetical relationship be- 
tween amount of anxiety toward school 
work and grades. 


of anxiety. Both the persons who f 

show very high anxiety and those who show very low anxiety make low 
grades in school. The low anxiety students would make low grades because 
they are so uninterested in accomplishment that they do not make the nec- 
essary effort. The high anxiety students are so worried about failure that 
they have no attention left for school work. This is an example of the type 
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of nonlinear relationship that could not be used in multiple-regression 
analysis but could be used in discriminatory analysis. In discriminatory 
analysis the ideal profile point for the anxiety variable would be in the 
middle of the test continuum. 

Linearity of relationships should be checked by an inspection of the 
scatter diagrams before a decision is made as to the type of analysis to 
use. The discriminatory type of analysis is more general and can use the 
information provided by some types of nonlinear relationships. However, 
regression analysis is easier for most persons to understand and compute. 
When the nonability areas of measurements are more effectively devel- 
oped, it can be expected that methods of discriminatory analysis will come 
more widely into use. 

A second feature which distinguishes the two types of analysis is the 
nature of the assessment score distribution. Multiple-regression analysis 
assumes that the assessment scores are continuously distributed, or at 
least approximately so. Discriminatory analysis assumes that the assess- 
ment is a qualitative distinction, a category rather than a continuum: all 
of the people who are successful in a particular job are considered to be 
of equal merit. 

There are advantages and disadvantages in considering the assessment 
as a continuously distributed variable rather than as a category. In many 
situations it makes more sense to assume that there are gradations of 
ability within the category of success. Such would be the case in school 
work, salesmanship, athletic ability, and job performance. However, it is 
not always possible to measure the fine gradations of ability, and some- 
times only the crude index of “pass” or “fail” can be used as the assess- 
ment. This is particularly so in the personality area, where it makes sense 
to validate an adjustment inventory against the dichotomized assessment 
of mental hospital patients versus nonhospital patients. To obtain finer 
gradations on the assessment would be very difficult, Some variables are 
inherently categorical rather than continuous. Examples of these are men 
versus women, sailors versus soldiers, and plumbers versus barbers. 

Whenever the purpose of the analysis is to distinguish qualitatively 
different groups or when only a dichotomized assessment can be obtained, 
discriminatory rather than multiple-regression analysis should generally 
be used. If continuous assessments are available and relationships be- 
tween predictor tests and assessments are reasonably linear, multiple 
regression should generally be used. 

In order to give any rules at all as to when multiple-regression analysis is 
preferable to discriminatory analysis and vice versa, it has been necessary to 
oversimplify the problems involved. Primarily it is necessary to consider 
whether the test battery is used for selection, guidance, or classification. 
In selection, the purpose is to choose only those people who are likely to 
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succeed in one or more of the positions involved. If regressions between 
tests and assessments are linear, multiple-regression analysis will provide 

n estimate of each person’s performance in each assessment situation. In 
this way people can be selected who will likely succeed in at least one 
position, It is best to think of selection as an institution-oriented proce- 
dure: the purpose is to choose the most promising persons and to reject 
those persons who fail to give promise for at least one position. 

Classification and guidance are more person oriented. The purposes 
are to place people in positions in which they will do better and to advise 
them to try positions in which they will do better. However, even if an 
individual is given the best placement possible, if he is relatively untal- 
ented, he may be only moderately sucessful. Such an individual would 
probably be rejected in a selection program. In classification and guid- 
ance, individuals cannot be rejected: the best placement must be found 
for each individual regardless of his over-all ability.2 In guidance and 
classification, the discriminatory approach has certain advantages over the 
multiple-regression approach, especially when regressions between tests 
and assessments are nonlinear. Discriminatory analysis will help put the 
individual into a position which is shared by other individuals of the 
same general interests, personality characteristics, and abilities. This is 
likely to make the individual feel comfortable in his job and comfortable 
with his work mates. 

Obviously, the institution-oriented and person-oriented philosophies 
and procedures are somewhat at odds with each other. In the long run, 
it can be expected that there will be a marriage of the two philosophies 
which will incorporate the concerns of both and provide more compre- 
hensive procedures for making decisions about human beings. This is 
presently one of the frontiers of measurement methodology. 


THE MULTIPLE-CUTOFF METHOD 


A third approach to the use of multiple measurements for the selection 
of personnel is to set cutoff scores on each of the tests. A person is selected 
for the particular job, school, or position, only if he scores at least as 
high as the cutting score on all tests. For example, a set of cutoff scores 
for selecting students for electronics training might be 80 on an intelli- 
gence test, 50 in perceptual speed, and 65 in motor dexterity. The actual 
cutoff points would be determined in such a way as to raise the prob- 
ability that the selected individuals will succeed. For example, the cutting 

* Rejection occurs in vocational guidance only in the extreme case where an indi- 
vidual is subnormal to the extent that he is advised not to enter school or work at all. 
In classification, rejection occurs only when an individual is released after an initial 
training period or when he is assigned to a low-level job not specifically intended for 
members of the group into which the individual was originally selected. 
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score for each test might be set so that 75 per cent of the persons scoring 
at or above the cutting score succeed on the job. 

The selection ratio determines how high the cutting scores are placed. 
If there is a large number of applicants and only a relatively small num- 
ber of these must be selected, the cutting scores can be placed high, On 
the other hand, if most of the people who apply must be selected, the 
cutting scores necessarily must be low. 

The multiple-cutoff method is particularly valuable when, as is some- 
times the case, tests are predictive only on their lower extremes. This 
tends to be the case for some of the motor, perceptual, and sensory abili- 
ties. For example, minimal levels of auditory and visual acuity are essen- 
tial for a number of jobs. However, beyond the minimum requirement, 
success on the job does not increase with higher and higher auditory and 
visual acuity. 

Multiple regression is a compensatory model in that a high score on one 
test can compensate for a low score on another test. This is a sensible 
model in some situations, such as in college training, where an individual 
can, to some extent, substitute one ability for another. However, in somi 
situations a high score on one measure cannot compensate for a low score 
on another measure. For example, in the operation of sound detection 
devices in submarines, the operator must have a certain level of auditory 
ability. Otherwise he will fail, regardless of how high he might score in 
intelligence and other abilities. 

The multiple-cutoff method is preferable to the multiple-regression 
model and discriminatory analysis when the tests are predictive only on 
their lower extremes. This can be seen by looking at the scatter diagrams 
comparing test performance with job performance. If the test is predictive 
only on the lower extreme, the regression will tend to be curvilinear and 
the points will be more closely packed about the regression line on the 
lower-score section of the test. An example is shown in Figure 7-9. i 

The multiple-cutoff method can be used in conjunction with other 
approaches. It is usually unwise to select an individual for a position or, 
in vocational guidance, to advise an individual to take a position, if he 
is very low in any one of the related abilities. Consequently, it is useful to 
set relatively low cutoff points for each test, and not consider an indi- 
vidual for a position if he is below the cutoff point on any one of them. | 
Then if, in the remaining group, test scores tend to correlate with job 
success in a linear manner, scores can be combined by multiple regres- 
sion. The final selection would be made on the basis of the combined 
scores. If people are being selected for several jobs, the procedure can 
be repeated for each job. Each job would have its minimum-cutoff scores 
for weeding out the obviously unfit, and each job would have its multiple- 
regression equation for the final selection. 
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Ficcne 7-9. Relationship between test scores and lormance in which a cutting 
vcore would be useful ie alaaa personnel. — i 7 
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PRACTICAL APPLICATIONS OF TEST BATTERIES 


There is a choice in many testing situations as to whether one general 
test should be used or whether a number of different tests should be 
administered in a battery. The potential advantage is always with mul- 
tiple measurements: more information cannot lower prediction. More- 
over, there are some circumstances in which it is especially important to 
measure the individual’s differential ability. One such circumstance is 
vocational guidance. The individual comes for advice; and, regardless 
of how high or low his over-all ability (profile level), the problem is to 
find the occupations that are best suited to his pattern of abilities (profile 
shape). Even the person who makes a very low score on a general in- 
telligence test might have certain aptitudes, such as motor dexterity and 
mechanical ability, that would suit him for some jobs. 

Any circumstance in which the effort is to classify persons after pre- 
liminary selection especially requires the use of multiple predictors. The 
officer training programs in the Armed Services exemplify such situations. 
First, the potential officers are tested as a prelude to acceptance in the 
officer training programs. One general measure can be used as a basis for 
preliminary selection. The general measure would then indicate much 
of what we have spoken of as level in the test profile. If the individual has 
a high level, it is certain that he is high in some of the separable abilities. 
After the course of training, the new officers can then be classified, or in 
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other words, placed on particular jobs. In the Army some might become 
artillery officers, some members of the intelligence sections, and others 
might enter communications activities. It would then be important to 
know the individual’s differential ability in order to match his special 
talents with particular assignments. Whereas the general test is sufficient 
for the initial selection of men, it is necessary to give a battery of tests 
for effective classification of the men later. 

After discussing the advantages of multiple measurements, it is im- 
portant to consider why test batteries rather than single tests are not 
employed more widely in practical work. One reason is that test batteries 
are more expensive to construct and use than single tests. Another reason 
is that good test batteries require careful preparation and research, and 
it is often the case that hastily constructed batteries fail to w ork well in 
practice. There is little doubt that well-designed test batteries will even- 
tually come into use in many testing programs. They will pay for 
themselves by improving the guidance, selection, and classification of 
individuals. The know-how is already available to construct better test 
batteries for many uses; and it will come to fruition when administrative 
officials in schools, the Armed Services, industry, and other institutions 
supply the funds and resources which are needed. 
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CHAPTER 8 


Test Construction 


So far we have talked about how tests are studied after they are 
developed, but no attention has been given to test construction. Here 
we enter into one of the most complex parts of measurement theory, and 
this introductory text will be able to cover only some of the general prin- 
ciples. The rules for test construction change in terms of the way in 
which tests are used. For example, if the question is asked, “Should a 
‘sood’ predictor test have items that are homogeneous to one another or 
should they be heterogeneous to one another?” the answer is, “Yes!” We 
will see how this and other seemingly paradoxical rules work in special 
testing problems. 

The rules for constructing a particular instrument can be clarified at 
the outset by deciding whether the test is intended to be a predictor or 
an assessment. Although there are desirable features which should be 
present in both types of instruments, such as high reliability, there are 
other points on which the standards are very different. We will begin 
by discussing the construction of predictor tests, then assessments, and 
end the chapter with a discussion of some features which they should 


have in common. 


CONSTRUCTION OF PREDICTOR TESTS 


As was cautioned earlier, before work is undertaken on a predictor 
test, some attention should be given to what it is meant to predict, the 
criterion, or assessment. In particular, studies should be undertaken to 
ensure that the criterion is at least moderately reliable. If not, the test 
constructor can perform some “criterion research,” as it was alluded to 
in Chapter 4, to help establish a reliable assessment. 

Job Analysis. Although it was said that a predictor test is initially 
formed from a hypothesis—a hypothesis that a particular kind of item 
will predict an assessment—the early hunches can usually be improved 
by studying the assessment situation. That is, if we are composing tests 
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to predict success in the operation of particular machinery, we will want 
to become acquainted with the job before constructing tests. Among 
other things, we will want to interview men who presently work at the 
job, talk with persons who manage the operation, and make notes about 
the exact duties which the job entails. Among the prominent things to 
look for are the following: 

1. Physical requirements. Does the job require considerable physical 
strength? Does a tall or a short person have an advantage in the work? 
(Aircraft companies employed midgets during the Second World War 
to work inside the wings of large airplanes.) Would an older person 
meet with difficulties? 

2. Sensory and motor skills. Does the job require particular kinds of 
muscular coordination? Is acuity of vision, or hearing, or touch vital to 
the job? Is rapidity of movement a necessary attribute? 

3. Intellectual abilities. What is the employee required te know? Is 
it necessary for the employee to study and understand complex instruc- 
tions? What kinds of decisions does the job require? 

4. Personality requirements. Using the word “personality” in the broad 
sense, what type of person gets along well on the job? Is it necessary for 
the employee to supervise other persons? Does the job require courage 
and/or calmness? Does the employee work independently or in coopera- 
tion with other men? 

A job analysis such as this will give some hints as to the kinds of tests 
which are likely to serve as successful predictors. In some situations, it 
is not necessary to go through such formal procedures of job analysis 
before constructing predictor tests. For example, in making up tests to 
predict school success, it is often assumed that the intellectual abilities 
will be the most important determinants of success; and tests appropriate 
to the type of school program can often be constructed without a careful 
scrutiny of the classroom work. However, when a new assessment is to 
be predicted, and the test constructor is not well acquainted with the 
job, a job analysis offers the most logical first step in a testing program. 

Type of Test. After surveying the job requirements and making some 
decisions about the attributes which will be tested, it is necessary to 
decide what form the tests will take. It may be that some of the attri- 
butes will have to be tested on single individuals, whereas some of the 
other tests can be given to a group of men at one time. Some of the 
attributes may require performance tests, and mechanical “gadgets” will 
need to be constructed. Some of the tests may be susceptible to “objec- 
tive” scoring; judges may be required for the rating of other test results. 
Such decisions about the nature of tests are determined in a large 
measure by practical considerations, by the resources available to the 
testing programs and the amount of time which each subject can spend 
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in taking tests. That is why paper-and-pencil multiple choice tests are 
used more frequently than any other kind. They are the easiest to con- 
struct, administer, and score. 

The Item Pool. After a decision has been made to construct a par- 

cular type of test, the next step is to compose a large number of test 
items, an item pool. The item is the lowest common denominator of a 
test, the individual “thing” which is scored. An item may be the time 
taken to assemble a mechanical puzzle, the blood pressure of a subject, 
or an arithmetic problem. In each case the item is the indivisible scoring 
unit. 

The kinds of items which are included in the item pool depend on the 
definition of the item universe, which can, at least theoretically, be 
thought to underlie particular tests. For example, if it is thought that 
mathematical reasoning problems will predict an assessment well, the 
item pool is filled with mathematical reasoning items. However, there 
are different kinds of mathematical reasoning items, and the test con- 
‘tructor sometimes has difficulty in stating explicitly which kinds will be 
included and which will be ignored. 

The more items which are composed for the item pool, the more room 
there will be to pick and choose among them in constructing a test. As 
a minimum, the item pool must, of course, have as many items as are 
planned for the eventual test. If resources permit, the item pool should 
have three or four times as many items as will be used in the test. It 
helps to have several persons compose items; otherwise the items might 
overemphasize the language and types of problems known to one person. 

The Review of Items. Before items are tried out empirically, the item 
pool should be inspected by several sophisticated judges for obvious 
flaws. The judges should look for the following things: 

1. Unclear statement of problems. It is possible to phrase an item so 
poorly that even a person who knows the subject matter thoroughly will 
mark the wrong alternative. Most of us are better able to find other 
people’s awkward expressions than to note our own mistakes. Therefore, 
an independent group of judges can usually point out a number of items 
which should either be thrown away or should be reworded. 

2. Wrong answers. Test constructors are capable of making errors, and 
what is stated to be the correct answer for a problem may not be correct 
at all. It is not uncommon to find tests some of whose items have no 
correct alternatives. The probability of this happening diminishes with 
the number of sophisticated individuals who inspect the items. 

3. Inappropriate difficulty. An inspection of the items will help weed 
out some of those which are either so easy that everyone will get them 
correct or so hard that practically no one will get them correct. An 
empirical tryout of the items will give more exact information about the 
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difficulty of items, but some initial inspection of the item pool can weed 
out extreme deviants. 

4, Inappropriateness to item universe. Judges should look for items 
which appear to be unrelated to the attribute which the test is intended 
to measure. For example, if a test of mathematical reasoning is being 
constructed, you would not want to have questions which use Latin 
words. It would then become partly a test of the individual's knowledge 
of Latin. You would probably not want to have problems dealing with 
specialized mathematical techniques, such as calculus. The individuals 
ability to answer the question would depend on whether or not he had 
studied the specialized techniques rather than on his ability to reason 
mathematically. In general, the item pool should be kept free of all 
abilities and types of information which are not directly related to the 
attribute to be measured. 

Empirical Tryout of Items. The next step in test construction is to 
administer the whole item pool to groups of subjects. If the item pool is 
large, say over three hundred items, it may be necessary io give part 
of the items to one group and part to another group. If more than one 
group is used, the groups should be homogeneous in respect to ability. 
It is essential that relatively large numbers of individuals try cach item. 
If there are fewer than 100 persons, only a limited amount of information 
can be obtained from the tryout. In order for the results of the tryout to 
be acceptably stable, as many as 300 subjects should be obtained, and 
the use of 1,000 subjects is not unusual. The availability of subjects sets 
a limit to the methods of test construction which can be used. If only 
several dozen subjects are available, which is often the case in composing 
tests for particular jobs, an empirical tryout of items will give statistical 
results which are more misleading than helpful. However, an empirical 
tryout with a few subjects is helpful in improving the test instructions 
and administration procedure. : 

After the empirical tryout of the item pool, a number of statistical 
procedures can be used to help construct a test. The purpose of item 
analysis is to select from an item pool a minimum number of items 
which will give a maximum prediction of a criterion. 


MULTITEST ITEM ANALYSIS 


In establishing the item pool, it is assumed that the items all measure 
the same thing. However, in spite of the best efforts, it may turn out 
that there are several different attributes involved in the items. For 
example, in a group of mathematical reasoning items, it may be found 
that the questions which require considerable arithmetic computation go 
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together as an attribute and that the items which require little com- 
putation and considerable reasoning go together as another attribute. 

Factor Analysis. There are statistical procedures which can be used to 
determine whether or not an item pool contains only one attribute, and 
i! not, how many attributes are involved. The most thorough procedure 
to use is factor analysis (to be discussed in the next chapter). This and 
other related procedures deal with the correlations among items. That is, 
just as it is possible to correlate tests with one another, the separate 
items within a test can be correlated with one another. If only a “pass” 
or “fail” score is given to each item, as is usually the case in multiple 
choice tests, special correlational formulas are required (see 8, chap. 10; 
10, chap. 6). 

\n attribute, or factor, is defined by a group of items whose members 
correlate highly among themselves and correlate very little with the items 
in other groups. If two items correlate highly, it means that they are 
measuring much the same thing. In very few testing programs is it 
{casible to undertake the labors of intercorrelating all of the items in an 
iiem pool. If there are 300 items, 44,850 separate correlation coefficients 
would need to be computed. The subsequent work of analysis would be 
even more laborious, making for an amount of calculation which would 
strain even our modern electronic computers. 

Approximate Procedures. One approximate procedure is to factor ana- 
lvze only a part of the total item pool. It is within practicality to factor 
analyze, say, a group of 50 instead of 800 items. The items which are 
‘actor analyzed can either be selected randomly from the total item 
pool; or, if it is suspected that certain groupings of the items will occur, 
a number of items can be chosen from each of the suspected groups. A 
factor analysis will then show how many factors are involved in the 
subsample of items from the pool. Then, by correlating all the items 
which were not in the factor analysis with each of the factors, the 
groupings for the total item pool can be estimated. There are numerous 
other such approximate procedures which are used to decide how many 
tests should be constructed from an item pool (see 14). 

Difficulty Levels. If the multitest approach is being used and statistical 
analyses have been undertaken to specify the item groupings, one more 
type of information is required before composing tests to measure the 
factors. It is necessary to determine the difficulty level for each item, 
which is the per cent of persons who get an item “correct.” You will 
remember that the purpose of item analysis is to maximally predict a 
criterion, In Chapter 5 the effect of the standard deviation on the 
correlation coefficient was described: the larger the dispersion, the higher 
the correlation in general. The difficulty levels of the items in a test 
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determine, in part, the standard deviation of scores on the test. Some 
illustrations will help to show how this works. If the items are so easy 
that everyone gets all of the items “correct” (the difficulty level for each 
item is 100 per cent), everyone will obtain the same score, and the 
standard deviation will be zero. The standard deviation wil! also be zero 
if the items are so hard that no one gets any of the items correct. 

If a multiple choice test is being used, guessing will probably influence 
the item difficulty levels. A certain per cent of the subjects will get an 
item correct by guessing even if no one really knows the answer. If there 
are n alternatives for an item, 1/n of the subjects would probably get the 
item correct by guessing alone. Consequently, it is not expected to find 
items with difficulty levels below 1/n. The minimum expected difficulty 
for five alternatives is 20 per cent, for four alternatives 25 per cent, and 
so on.t 

The number of alternatives used for each item usually restricts the 
range of possible scores. With five alternatives, the range is from 20 per 
cent correct answers to 100 per cent correct answers. Then in a 40-item 
test the scores could range from about 8 to 40. The problem is to choose 
items in terms of difficulty levels in such a way as to spread people out 
widely in the possible score range. The rule is to pick a set of items whose 
average difficulty level is near the middle of the possible score range. 
That is, in developing a test with five alternatives for each item, items 
should be chosen in such a manner that their average difficulty is about 
60 per cent. The average difficulty is found by summing the separate 
difficulty levels and dividing by the number of items. It is not important 
that the mean difficulty be exactly at the middle of the possible score 
range, but the standard deviation of test scores will be lowered if the 
mean difficulty is near either the bottom or top of the possible score 
range. 

In addition to the decision to choose items that have a mean difficulty 
near the middle of the possible score range, there is a question as to the 
distribution of difficulties that should be sought. Should all of the items 
have difficulties close to the middle of the score range, should half of 
the items be very easy ones and the others very hard items, or should 
a particular distribution of difficulties be sought? The ideal distribution 
of difficulties varies in terms of the use which will be made of the test 


and the intercorrelations of the items. Consequently, no “airtight” rule 
can be given for all situations. 


After an empirical tryout of an item pool, a good general procedure 


is to choose approximately an equal number of items at each difficulty 
* Although this rule is generally useful, there 


is inappropriate. This is particularly so when 
designed to contain “misleads,” i 


are special circumstances in which it 
the item alternatives are carefully 
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level in the possible score range.* That is, if a 40-item test is being con- 
structed with 5 alternatives for each item, the procedure would be to 
choose 5 items with difficulties between 20 and 30 per cent, 5 items with 
difficulties between 30 and 40 per cent, and so on to 5 items with diffi- 
culties between 90 and 100 per cent. If in the item pool there are more 
than the required number of items at particular difficulty levels, there 
is a question as to which of the items should be included in the test. 
If the items have been previously studied by factor analysis, those items 
should be used that correlate most highly with the factor which the test 
is intended to measure. 

The Results of a Multitest Item Analysis. A test should be constructed 
for each of the major item groupings (factors), or at least for the ones 
that appear to be useful predictors. Scores on the separate tests can be 
combined in the ways which were discussed in Chapter 7 to give the best 
prediction of a particular criterion. 


UNITEST ITEM ANALYSIS 


It may be decided in a testing program that only one test can be used 
or, in other words, that it will be permissible to derive only one test from 
an item pool. This decision may be determined in part by principles of 
economy, because it is usually more expensive to derive and use more 
than one test. However, there is some confusion as to what constitutes 
one test. If the scores on 60 items are added up to obtain one score, it 
is said that one test is in use; but if the items are divided into three 
groups of twenty each, and each of the three groups of items is scored 
separately, it is said that three tests are in use. It is customary to say that 
there are as many tests being used as there are separate scores for each 
person, and the number of scores given to each person is not necessarily 
related to the length or appearance of the test material. 

It is unwise to begin the study of an item pool with the assumption that 
only one test will be developed, because, as was said in the previous 
section, statistical analyses may indicate that a number of factors under- 
lie the item pool. Although it is usually advantageous to use the multitest 
approach on the item pool, the most legitimate reason for using the 
unitest approach instead is that the laborious procedures of statistical 


? It must be kept in mind that the rule is most applicable to general-purpose tests, in 
which reliable differentiations at all points on the score continuum are equally impor- 
tant, In certain special cases a test is used only to distinguish certain “upper and 
“lower” groups, for example to pick out the top 25 per cent. In that case the item 
difficulties would be massed near the cutting point rather than near the middle of the 
possible score range. That is, most of the items would be chosen in the 60 to 90 per 
cent difficulty range. Other procedures of item selection are required for special 


purposes. 
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analysis needed to determine the factors in an item pool are not feasib 
in many testing situations. 

Multiple Regression. If the whole item pool has been given an empii 
ical tryout, there is an ideal way to combine the items into a test: name 
to combine all of the items by multiple-regression formulas to pred 
the criterion. The analysis would first require all of the intercorrelation: 
among the items, plus the correlation of each separate item with th 
criterion. Then it would be necessary to run a complete multiple-regre: 
sion analysis, obtaining a beta weight for each item. If this ideal method 
of analysis were undertaken, it would give the best “least squares” 
diction of the criterion and would even do a better job than the multi 
approach. Unfortunately, such an analysis is totally impractical. Multip 
regression weights are difficult enough to compute with five or six va 
ables; it is out of the question to perform such an analysis for, say, | 
300 items in an item pool. Not being able to perform this ideal type ol 
analysis, some approximate approaches are necessary, 

Item Difficulties. What was said about item difficulties in respect 
the multitest approach holds as well for the unitest approach, The m 
variance in the total scores the more the test is likely to correlate 
the criterion. Therefore, a set of items should be chosen whose mean 
difficulty is at about the middle of the possible score range. The ite 
should also be scattered along the range of difficulty in general resem: 
blance to the scheme which was previously described. 

Item-criterion Correlations. If it is possible to perform the labors 
correlating each item in the item pool with the criterion, this information 
can be combined with a knowledge of difficulty levels to make a more 
predictive test. The test score is usually obtained as a sum of the it 
scores, which is the number of correct responses. If the individual ite 
do not correlate with the criterion, the total score will not correlate with 
the criterion either, 

A test will be more predictive if the items which correlate well with 
the criterion are chosen. The rule is that from the items at each difficu 
level select those which correlate highest with the criterion. That is, if 
the item pool has 20 items with “difficulties” between 80 and 40 per cent, 
and the test will only be long enough to use five items in that category, 
the five items should be chosen which have the highest correlations with 
the criterion, 4 

Interitem Correlations. If it is possible to obtain not only the difficulty 
levels and the item-criterion correlations, but the intercorrelations of all 
of the items in the item pool as well, this information can be used to 
derive a more predictive test. It was said that multiple regression is the 
ideal method of unitest analysis. You will remember from Chapter 7 that 
the multiple correlation is highest when the predictor variables (items 
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case) correlate high with the criterion and low with one another. 


Therefore, items should be chosen which not only correlate as much as 
possible with the criterion and conform to the requirements for difficulty 
leve] but which correlate as little as possible with one another. There 
are a number of item-analysis procedures which take into consideration 
both: ‘tem-criterion correlations and interitem correlations (see 3, 6, 7, 10, 
12) 

In the multitest method, items are grouped, or placed in factors be- 
cause they correlate highly with one another. In the unitest method, items 
should correlate low with one another. In other words, the items within 
any one of the tests derived by the multitest approach should be homo- 
geneous in respect to one another, and the items within a test obtained 
by the unitest approach should be heterogeneous in respect to one an- 


other. Consequently, the apparent paradox introduced earlier in the 
chapter is not a paradox after all: different rules apply in accordance 
with the item-analysis strategy being used. 


DECISIONS ABOUT THE CONSTRUCTION OF PREDICTOR TESTS 


Having seen some of the formal procedures of analysis which can be 
used, we should next ask, “What do people usually do in constructing 
tests?” To begin with, tests are often constructed without any kind of 
item analysis. People are often able to “dream up” a set of test items, 
which. when tried out, conforms fairly well to the statistical requirements. 
The author often requires his students to compose tests for the prediction 
of school grades. The students have neither the time nor the resources 
to compose large item pools, undertake empirical tryouts, or perform 
elaborate statistical analyses. It is surprising to find how many of these 
tests actually predict the criterion with moderate success and also have 
distributions of item difficulties which are acceptable. 

It is not recommended that formal procedures of test construction be 
ignored, but it is possible, and often happens, that an intuitively com- 
posed test will give satisfactory prediction. However, the item-analysis 
procedures are not dependent on the success of the individual’s intuitive 
processes, and the item-analysis methods will almost always lead to a 
higher test-criterion correlation. ~ 

If resources are available for performing item analyses, then a decision 
must be made as to what procedures should be used. A failure to dis- 
tinguish between the multitest and unitest approaches has caused some 
confusion about proper methods of analysis. If the additional labors 
required by the multitest method can be undertaken, that will usually 
give the most satisfactory results. JA 

In addition to the likelihood of obtaining a higher prediction of the 
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criterion from the multitest approach, the results have a generality 
beyond the immediate testing problem. The unitest method tends to 
produce a conglomeration of items which is directly related to only one 
assessment, or criterion. The multitest method divides items into homo- 
geneous groups, or factors, which can be validated against numerous 
criteria. Factor analysis is often used when there is m particular cri- 
terion to be predicted. Then the effort is to derive gencral attributes 
which can, in the course of events, be used to predict many different 
criteria, 


Taking Advantage of Chance. There is an important caution to be 
noted in the interpretation of the results of any item analysis which 
selects items in terms of the item-criterion correlations (which is done 
with some of the unitest methods but not with the mullitest methods), 


If there are many items in ‘the item pool, some of them will correlate 
highly with the criterion if only by chance. The more individuals used 
in the empirical tryout, the smaller will be the sampling error of correla- 
tion coefficients. As an extreme example, if the number of individuals is 
no larger than the number of items, it is always possible to find a perfect 
multiple correlation between the items and any other variable. You can 
name any other variable that you like, the weights of the subjects, for 
example, and the perfect multiple correlation can be found. However, 
the apparent validity would be only a spurious capitalization on sam- 
pling errors, 

When we select only the few items which correlate highly with the 
criterion, this “takes advantage of chance.” A second empirical tryout 
will show that the selected items correlate lower with the criterion than 
was indicated in the first tryout. Consequently, a test derived from the use 
of item-criterion correlations will work less well in subsequent use than 
it appears to work on the original tryout group. After obtaining a test 
from item-criterion correlations, the test scores can be correlated with 
the criterion scores. We refer to this type of correlation as re-correlation, 
because the measure of predictive validity is obtained on the same sample 
of scores used to develop the test. Re-correlations are almost always 
spuriously large. There have been numerous instances in which a derived 
test has been re-correlated with the criterion to produce a seemingly 
handsome validity, only to find in subsequent research that the test has 
no predictive validity at all, 

Cross validation. Re-correlations are so misleading that it is better not 
to compute them. After the item analysis of the first tryout group, the 
test obtained should be administered to a second group of individuals. 
The correlation between the scores made by the second group and the 
criterion is the predictive validity of the test. The use of a second group 
in this way is referred to as cross validation, 
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CONSTRUCTION OF ASSESSMENTS 


‘he major concern in this section will be with the construction of class- 


room examinations. Many of the varied assessments which are made of 
human performance are not, properly speaking, tests at all. The promo- 
tion of a naval officer, the hiring of a chef, the elevation of a business 


executive to a higher position may entail none of the formal, standardized 
procedures which we call tests. However, many of the requirements of 


a good assessment test apply to assessments in general. 

The Importance of Tests. Whenever students receive grades for their 
work, tests have a very large influence on human lives. The grade which 
a student obtains may determine whether or not he graduates from high 
school, is given a better position in civil service, or receives a commission 
in the Armed Forces. The teacher who takes his work seriously finds the 
fair allotment of grades to be the most arduous, and sometimes painful, 
part of his job. Every teacher has seen the time when a student's whole 


life had to be rearranged because of a failing grade. 

Tests not only share in the responsibility of allotting grades, they also 
determine in large measure what is learned. Students are very quick to 
detect what the teacher emphasizes in examinations, and it is only human 
for the student to work hardest on the things which are likely to appear 
on the test. Tests can also affect what the teacher teaches. If standard 
achievement tests are used, as they are in many school systems, the 
teacher is often under pressure to have the students do as well as 
possible. He is then likely to slant his instruction toward the things 
emphasized on the achievement test, even to the point of nearly re- 
hearsing the actual examination. 

Because of the importance of classroom examinations, they merit very 
careful preparation. It is unfortunate that the work load imposed on 
many teachers prevents their careful attention to the construction and 
grading of each test. A classroom examination is “good” if it is both 
reliable and valid. Anything that can be done to raise the reliability of 
the examination along the lines discussed in Chapter 6 will make for a 
better test. Validity of a classroom examination means content repre- 
sentativeness, in other words, that a wide sample of important questions 
has been employed. 

Educational Goals. What constitutes important materials can be deter- 
mined only by the teacher's careful attention to the aims of the course. 
Part of this decision should be determined by the future courses which 
the student will take. The first course in calculus must prepare the 
students for the second course in calculus. Although there is some latitude 
as to how it is taught, there are certain fundamentals to be covered. 


152 Tests and Measurements 


Outside the common core of knowledge that is expected to be imparted 
g 8 I 


in a particular course, the instructor adds his own particulas flavor to 
the subject he teaches. The student “takes the instructor” as well as 
taking the course. Each teacher has his own pet theories, anecdotes, and 
his strong and weak points in imparting the course materi. To an 
extent this is very good, because much of education is conco% ed with 
learning how other people think. Whatever special directions are given to 
the subject matter, the instructor should be aware of what he is trying 
to teach and make course examinations which will reflect his goals. 


Outline of Content. Most instructors find it helpful to outline the goals 
of a course, to write down the important principles that they want to 
convey. Such an outline is not quite the same as the usual course outline, 
the latter being more concerned with the order of presenting topics than 
with the goals of instruction. In this textbook, for example, the goal is 
to convey several dozen general principles, some of which concern: 

1. The logical foundations of psychological measures 

2. The distinction between “processes” and individual dilicrences 

3. The characteristics of assessments and predictors 

4. The nature of statistical prediction 

5. The influence of measurement error on predictions 
The author bases his own course examinations on a longer list of such 
issues. These are the things which he thinks will help the student in 
taking future courses, provide the widest background for using tests, 
and generally make him more erudite in the subject matter. 

The outline should be spelled out in greater detail than the several 
points listed above, including subheadings under each topic concerning 
the more particular goals in each section of instruction. Although most 
instructors agree, at least in principle, that the general points of view 
are the more important things to convey, the outline will include a knowl- 
edge of particular techniques, facts, and even some names and dates. The 
instructor must decide which details are important for students to learn 
and, consequently, which of these will find their way into examina- 
tions. 

A carefully done outline of course goals will not only prove useful 
in constructing examinations, it will help the instructor in teaching his 
course. He may note that he has been underemphasizing important points 
in the outline, or that relatively unimportant topics have occupied too 
much of the course time. After the outline is constructed, and some 
changes in it would be expected with the instructor's continued experi- 
ence with the subject matter, the problem is to translate the outline into a 
successful examination. At this point we depart from the skills of class- 
room instruction and move into the specialty of psychological measure- 
ment, The instructor who is wise in the ways of teaching and who may 
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tline excellently the class goals may fail miserably in translating the 
ils into a creditable examination. Similarly, the measurement specialist 
ften not qualified to make decisions about the importance of various 

cts of particular subject matters. A happy blending of skills occurs 
wn the teacher has a firm understanding of the basic principles of 


ling 
ing. 


ESSAY VERSUS MULTIPLE CHOICE EXAMINATIONS 


\s with predictor tests, there is, at least conceivably, a wide choice of 
forms for a particular assessment, including performance tests, work 
samples, ratings, and many others. The attention here will be given to 
paper-and-pencil forms, by far the most popular type of classroom 

amination. 

In the essay examination, the student is presented with a few ques- 
tions or problems and asked to supply the answers. In the multiple choice 
form the student is presented with a large number of questions, with a 
limited number of alternative answers for each. Considering the wide use 
of both of these forms and the amount of controversy which has been 
ecnerated over which is “better,” their differences and special advantages 
will be discussed in some detail. 

Convenience. The essay examination is comparatively easy to con- 
‘ract but difficult to grade with more than a dozen students. An ade- 
quate multiple choice examination is much more difficult to construct, 
bat it can be graded easily even with hundreds of students. There are 
no forms to be reproduced with the essay examination and no special 
materials are needed. The questions can be written on the blackboard 
for all students to see. It is usually necessary to have the multiple choice 
examination typed and mechanically reproduced, both of which may be 
a strain on school facilities. In addition, the multiple choice test may 
require special answer sheets and scoring forms. In general, the more 
students to be tested either at one time or in the eventual use of the test, 
the more practical convenience is obtained from the multiple choice form. 

Reliability. The multiple choice test is usually more reliable, and sub- 
stantially more so, than the essay test. The unreliability of the essay 
examination comes from two main sources: from test scoring and from 
the sampling of content. Teachers often disagree about the grade for 
particular papers, and individual teachers often give considerably dif- 
ferent marks to the same papers which they regrade after a period of 
time. Such inconsistencies are dependent on the teacher, with some in- 
structors being much less reliable in their grading. There is more error 
due to the sampling of content in essay examinations because of the 
small number of questions which are used. A student’s grade can depend 
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considerably on his “luck” in having understood the particniar materials 
on the test. The many items which can be used on a multiple choice test 
provide an opportunity to spread questions widely over the subject 
matter. 

Representing the Important Content. It is in this cate ‘ory that the 
strongest arguments have been given for the use of the essay examination. 
It has been pointed out by many persons that the multiple choice exam- 
ination often emphasizes the simple facts of a subject matter while pro- 
viding no evidence about the student’s knowledge of more general issues, 
The multiple choice test has been criticized as being only a measure of 
memory rather than a measure of understanding. Another argument is 
that the multiple choice examination reflects none of the students 
creative ability. 

Proponents of multiple choice examinations point out that regardless 
of how valid essay examinations might be in theory, the unreliability of 
essay examinations keeps them at a relatively low level o! validity in 
practice. The criticisms of multiple choice examinations usually concern 
how bad they can be when improperly constructed and ignore the 
extent to which important materials can be framed in multiple choice 
form by the ingenious test constructor. There is a considerable amount 
of research evidence which indicates that well-constructed multiple 
choice examinations can effectively measure many different kinds of 
course content, even with materials that would not seem appropriate. For 
example, Tyler (13) found a correlation of 85 between the ability of 
students to draw generalizations from data and a multiple choice test, 
the alternatives being better and worse generalizations which could be 
attributed to the information. Even in the testing of French pronunci- 
ation, high correlations were found between samples of speech and a 
multiple choice test (9). Other studies have shown that multiple choice 
tests can effectively measure some of the skills in writing and mathe- 
matics (2, p. 73). 

Rather than try to decide whether multiple choice examinations are 
generally better than tests of the essay type, or vice versa, it would be 
more appropriate to see how they can both be made as effective as 
possible and how they can be used in conjunction with one another. 
When the course material is mainly factual and technical, or the con- 
cepts to be taught are relatively simple, there is a definite advantage for 
the multiple choice test, Multiple choice tests are generally more appli- 
cable at lower than at higher levels of education. Multiple choice tests 
can be justified in elementary school, and in many of the courses in high 
school and college. Some of the courses in high school and college which 
require creative talents and the ability to organize ideas have a more 
definite need for essay type examinations. In most graduate studies, it is 
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very difficult to place the whole burden of determining grades on 
oultiple choice tests alone. 


COMPOSING ESSAY EXAMINATIONS 


rhe Wording of Questions. Essay questions should range as broadly 

possible across the course content. Content representativeness can be 
«reased by having a larger number of questions for which the student 
is expected to write only a paragraph or two. Very general questions 
should be avoided, such as “How did Hitler come to power?” “What 
happened to art in the fifteenth century?” It directs the student more 
accurately to a particular topic if the essay question spells out the points 
which are to be discussed, an example being as follows: 
Discuss the use of the standard error of estimate with tests under the 
following headings: 
1. The statistical derivation 
The meaning of the “per cent of variance explained” 

5. The effect on the standard error of estimate of changes in the dis- 
persion of assessment scores 
The assumption of homoscedacticity 

Choice of Questions. A widely used procedure with essay examinations 
s to let the student choose a few from a larger number of questions. 
hough this is a popular procedure with students, it has definite dis- 
vantages. Because the questions on a test are intended to be a sample 
of the course content, allowing the student to choose among the questions 
reduces the value of the test as a sample. Also, this procedure makes it 
dificult to compare students with one another. It is possible that two 
students would write on entirely different questions, providing no basis 
of comparison between them. 

Grading. There are several steps which can be taken to remove the 
largest flaw in essay examinations, the unreliability of grading. The 
primary way to increase the reliability is to use more than one grader. 
The student should then be given the average grade. The use of two 
graders will substantially raise the reliability, and if it is feasible, as 
many as five instructors can collaborate to make the grades even more 
reliable. Multiple gradings are usually employed with important graduate 
school examinations. 

If several persons grade the same examinations, each grader should 
have no knowledge of the grades given by the other graders. In order 
to discount the personal likes and dislikes of the graders, the students 
names should not be placed on the tests. Before giving grades, it helps 
the instructor to first read through all, or a number, of the tests. A better 
basis of comparison can be had by grading one question through all of 
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the examinations and then returning to the second question. It is better 
to list the grades for each question on a separate sheet. if grades are 
placed on the tests, the first grades which are given might prejudice 
the instructor toward later questions. 

It is very difficult for the instructor to avoid certain biases in the grad- 
ing of essay examinations. The student who writes well in general has 
an advantage on most essay questions. If there are too man questions 
set for the allotted time period, an advantage is given to the person who 
can express himself rapidly in writing. The student with poor penman- 
ship and spelling is often penalized on examinations that have little to 
do with basic writing skills. Unless the course material specifically con- 
cerns writing skills, the instructor should grade in terms of the content, 
not in terms of the elegance of presentation, 


THE CONSTRUCTION OF MULTIPLE CHOICE EXAMINATIONS 


The Item Pool. A collection of items is needed for the classroom test 
as well as for the construction of a predictor test. It is also helpful to 
have many more items than will be used in the test. The principles used 
in the selection of the “best” items for the classroom test ure different 
from those used to compose a predictor test. In composing « predictor 
test, items are selected in such a way as to maximize the prediction of 
a criterion. In constructing an assessment test, items are selected in such 
a way as to maximally represent the performance or, in the case of a 
classroom examination, the course content. 

The outline of content serves as a guide to the collection of items. One 
way to proceed is to make up a number of items for each of the sub- 
headings in the outline, Instructors usually find it easier to collect items 
over a period of time to fit the various parts of the outline, rather than 
compose them all at once. The author establishes an item pool for a 
course in tests and measurements by writing down items as they occur 
to him, or after an important issue has been raised in classroom dis- 
cussion. Over a period of time some hundreds of items have been ob- 
tained in this way, with different item pools available for each of the 
several tests used during the semester, 

Choice of Items. Most instructors agree that the general points of view 
and the ability to use course content are more important than segregated 
facts and memory for specific details, Although some simple factual items 
should appear in most tests, many of the items should concern the 
student’s ability to understand issues and reason from one situation to 
another. Items of that kind are difficult to construct, but carefully com- 


posed multiple choice items can measure the student’s understanding of 
complex issues. 
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The Wording of Questions. Careful attention should be given to the 
ling of both predictor and assessment test items. There is generally 

' difficulty in the wording of classroom examinations because the 
qvestions are usually more complex. The inexperienced instructor often 
tikes some glaring mistakes in the wording of multiple choice items. 
l Lere is a considerable body of information on the writing of test items 

1, 4, 11). Some of the most important rules will be briefly described. 

Make all of the alternatives grammatically consistent with the stem 
c “question”)., For example, with a stem like “The piston ring is,” 
you would not use a plural noun for an alternative, such as “three small 
gears.” Or if the stem ends with the article “a,” you should not include 
alternatives which begin with vowels. There are many other ways in 
which grammatical inconsistencies can “give away” an item. 

2. Do not make the correct alternatives consistently longer or shorter 
than the incorrect alternatives. In the same vein, do not overspecify the 
correct alternative, such as using remarks in parentheses and extra detail 
which will give away the answer. A typical mistake is to be rather casual 
in the formulation of incorrect alternatives and to highly specify the 
correct alternatives. 

3 Make the incorrect alternatives plausible. Sometimes an individual 
gets the correct answer because the other alternatives are obviously not 
related to the subject matter. The rule is that all of the alternatives 
should be equally appealing to the person who does not know the answer, 
but only one alternative should make sense to the person who knows 
the answer. Illustrating this rule with the question, “Which one of the 
following is one of the psychophysical methods?” we would not employ 
an incorrect alternative such as “calculus,” but a plausible, though in- 
correct, alternative such as “triadic inconsistencies.” 

i, Avoid alternatives like “none of the above.” It is difficult to frame 
items in such a way that one of the alternatives is absolutely true in all 
circumstances. The better logic is to construct alternatives in such a way 
that one of them is markedly more correct than the others. The use of 
“none of the above” places a double and somewhat confusing burden 
on the student: he not only must detect the alternative which is most 
correct, he must decide whether or not it is absolutely correct. 

5. Use a random order for the alternatives in each item. This can be 
done with a table of random numbers to be found in many books on 
statistics (see 5). Another approach is to write the alternative positions 
on separate file cards (if there are five alternatives, the letter a would 
be written on one card, the letter b on another and so on to e); the cards 
can then be shuffled and dealt out to allot alternatives to their positions 
beneath the question. Some instructors arrange the alternatives sys- 
tematically throughout the test, insuring that the correct alternative ap- 
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pears an equal number of times at each position. Even this method can 
give away answers. Randomization of alternatives is the safest procedure, 

Review of Items. As in the construction of predictor tests, it is helpful 
to have several persons, or at least one person, review the items which 
go into a classroom examination. This may be too laborious for the whole 
item pool, but someone should at least read and criticize the smaller 
number of items which are used in the test. All of the points mentioned 
earlier in respect to the review of predictor test items apply to assessment 
tests. 

Selection of Items for the Test. Whereas elaborate methods of item 
analysis were recommended for the selection of items for a predictor 
test, the item pool need only be sampled to obtain the items for an assess- 
ment test. If the item pool is representative of the important course con- 
tent, and this rests on the instructor’s judgment, a sample of the items 
will serve as an adequate test. The author selects the items for his class- 
room tests by drawing 40 items at random from an item pool. Any ran- 
dom sample of forty will give much the same results that would be 
obtained from any other random sample and much the same results as 
if the whole pool were made into a test. The sample of items should be 
inspected to make sure that they have all been covered in the course 
work. 

Some people feel uneasy in randomly selecting items from the item 
pool, because they think this method is loose and inexact. But randomness 
is desirable both from the standpoint of sampling logic and to prevent 
the test from being slanted toward any particular feature of the content. 
If the class is told in advance that the items will be randomly drawn - 
from a larger pool, they are more content to apply themselves to the 
whole subject matter rather than to try to “dope out” what the test will 
be about. Also, because the instructor will not want to use the same test 
over and over with new classes, a new test can be constructed by a 
second random drawing of items. 

It is sometimes useful to divide the items in a pool into several cate- 
gories and randomly select a number of items from each. This can be 
done if there are definite divisions of the subject matter. However, the 
score obtained on a completely random selection of items will be very 
much the same as obtained by using categories. 

Item Analysis. Some information can be obtained about the effective- 
ness of an assessment test by an examination of the scores obtained by 
students. The classroom instructor seldom has the facilities to administer 
the whole item pool to a group of students and must rely, instead, on 
the results from the items in one test. There are several desirable features 
which should appear in multiple choice examinations. One of these is 
that the frequencies of choices for the incorrect alternatives should be 
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approximately equal. If practically no one chooses an alternative, it is 
:ı uve away” and should be replaced with a more plausible incorrect 
l: rnative. If one of the incorrect alternatives is chosen more frequently 
tlin the designated “correct” alternative, the instructor may be wrong 
hinself, or the question may be misleading. Even on a classroom exam- 
inotion there is little reason for retaining an item that no one, or almost 
no one, gets correct. It is more reasonable to retain a few very easy items 
ası way of giving the average student a feeling of accomplishment. 
lt is axiomatic that students should receive better grades on an exam- 
invtion after the course of instruction than before the course begins. 
Similarly the questions which show the biggest change in number of 
students who get the correct answer from before to after are more par- 
ticularly concerned with the course content, In large school programs, 
such as some of those in the Armed Forces, examinations are sometimes 
constructed on the basis of an improvement in item scores from before 
to after the course. This logic can be carried too far in that an item might 
be very difficult both before and after the course not because of inappro- 
piiuteness, but because of shortcomings of the instruction. 


PREPARING TESTS FOR ADMINISTRATION 


After items have been selected from an item pool, either in the con- 
struction of a predictor or an assessment test, some practical decisions 
must be made about how the items will be assembled and administered. 
It must be decided whether or not to order the items in terms of difficulty, 
placing the easier items at the first of the test and the harder ones toward 
the ond. It is sometimes desirable to order the items in terms of difficulty 
to prevent students from becoming discouraged because of several diffi- 
cult items which might appear near the beginning of the test. The items 
can be ordered in terms of difficulty only if an empirical tryout has been 
made. Such information is likely to be available for a predictor test but 
not for the usual classroom examination, What an instructor thinks will 
be an easy item may prove to be very difficult for the class and vice 
versa. If difficulty levels have not been obtained from an empirical tryout, 
it is better to randomly order the items throughout the test and inform 
the students about it. If each item has been written on a separate file 
card, the randomization can be performed by shuffling the cards thor- 
oughly and “dealing” them out in the order of their appearance. 

Test Instructions. The test instructions bear a considerable burden of 
accurately directing subjects to the test content. The failure to make test 
instructions sufficiently clear and detailed is one of the most obvious 
shortcomings of the inexperienced test constructor. The people who deal 
extensively with tests realize that some of the subjects will become con- 
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fused by even the most straightforward instructions. It is not uncommon 
for subjects to fail to put their names in the designated piace. If the 
subjects are told not to write on the test booklet, some booklets will 
have numerous marks on them. If the items are at all complicated, such 
as asking the subjects to pick the two most correct alternatives instead 
of only one, some of the subjects will inevitably become connused. The 
ease with which subjects become confused with unusual test items, or 
marking procedures, often forces the test constructor to abandon an 
apparently promising new approach. 

The rule is to write test instructions as though all of the subjects were 
of very low intelligence. Test instructions should be written so that the 
dullest and most confused individual will understand them. To do this, 
the instructions should be made redundant and all points should be 
generally overexplained. For example, in a multiple choice test for which 
only one alternative is to be marked, instructions should be written as 
follows: “Mark one of the alternatives for each question. Do not mark 
more than one alternative for each question. Leave none of the questions 
blank.” If special answer sheets are used with the test, at least one, and 
sometimes several, examples should be given. 

Time Limits. If there is a time limit for completing the test. and there 
usually is, the test constructor must determine in advance how much 
time should be given. If the attribute being tested is logically « oncerned 
with speed, studies should be undertaken to determine the time limit 
which will produce the largest standard deviation of test scores. This can 
be done by trying a number of different time limits with separate groups 
of subjects. The time which works the best can then be made the standard 
time limit. 

There has been a general tendency to make restrictive time limits on 
tests which are not inherently concerned with speed. Some of the pre- 
dictor tests, such as numerical computation and perceptual speed, are 
defined as “speeded” abilities. However, we often find speeded tests of 
reasoning and vocabulary, where there is little ground for restrictive time 
limits. There are few classroom examinations where restrictive time limits 
are justified. Some classroom topics, such as typing and shorthand are 
naturally concerned with the ability to work quickly, However, there is 
little reason to believe that speed is necessarily involved in the under- 
standing of physics, history, psychology and most other school topics: 
When the test is not intended to measure speed, per se, subjects should 
be given a comfortable time limit. A few students will feel that they do 
not have an adequate amount of time even if they work an hour after 
the rest of the students have left. Some time limit is necessary simply as 
a convenience fcr the instructor, but most of the students should be able 
to complete the test without feeling pressured by the time limit. 
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\uxiliary Materials. Preparations must be made for any special mate- 
rials which are needed for the test. If a particular kind of pencil is re- 
quired, an adequate supply must be on hand. If devices such as rulers, 
(ving equipment, or other instruments are needed, subjects should be 
informed well in advance of the test. If mathematical problems are part 
of the test, the subjects should either be told to bring “scratch” paper 
or it should be provided at the time of testing. Here again, it pays to 
overexplain the test requirements. If the answer sheet must be marked 
in pencil, some of the students will come only with a pen. Some pencil 
points will inevitably break and there must be either extra pencils or a 
sharpener handy. Trivial as these points seem, the test constructor must 
either provide for them or the test reliability will be lowered. 


SUMMARY 


There are two basically different approaches to test construction de- 
pending on whether the instrument is intended to be a predictor or an 
assessment. A predictor test is constructed in terms of statistical strategy, 
the purpose of which is to squeeze the last ounce of predictive efficiency 
out of an item pool. Some of the statistical methods which ideally should 
be applied are too complex and laborious to be used except in special 
cases. Therefore, there is a hierarchy of preferred procedures, depending 
on the time and facilities available. 

\n assessment test is constructed in terms of rational rather than statis- 
tical standards. The purpose is to include the important content in an 
examination. Only by a clear statement of the goals of instruction, or in 
a nonclassroom situation, the merits of different behaviors, can an assess- 
ment test be meaningfully reviewed, discussed, and eventually changed 
for the better. In the construction of both predictor and assessment tests 
it is important to write items clearly, standardize the scoring, state the 
instructions carefully, and systematize the administration procedure. 
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CHAPTER 9 


The Structure of Human Abilities 


The term “general intelligence” has been used with reluctance in previ- 
ous pages because of two misleading connotations which it bears. The 
first is that the tests which are used to measure “general intelligence” are 
not assessments, as the term “intelligence” suggests, but are predictor 
tests which have gained their meaning from numerous relationships with 
important human variables. The second misleading connotation is that 
the word “general” implies that there is only one kind of intelligence 
rather than separate dimensions of intelligence. 

We usually talk as though human abilities are general. We say, “He is 
a ‘smart’ person,” without bothering to differentiate the ways in which 
the individual is particularly gifted. Sometimes we do make a distinction, 
as in saying, “He is very good in mathematical work, but he is not so 
good in understanding written material.” 

i! intelligence is perfectly general, the person who can do one type of 
intellectual problem well can do all other kinds of problems well, and 
the person who does poorly on one type of problem will do poorly on 
all others. Looking at it in terms of the ability to do school work, if 
abilities are perfectly general, the child who learns easily to solve 
arithmetic problems will have the same facility in learning spelling, geog- 
raphy, and other topics. Another possibility is that human abilities are 
completely “specific,” that there are no correlations between different 
intellectual tasks. Then, if we know that a particular child is very adept 
at learning arithmetic, this offers no basis at all for predicting how well 
he will do in spelling or geography. Taking the case further, if abilities 
were completely unrelated, the child who could add numbers very 
quickly might be slower than the average child in performing multipli- 
cations, 

Human abilities are neither completely general nor completely specific. 
The real story lies between these two extremes. In this chapter, we will 
discuss the history of this problem, the mathematical procedures of factor 
analysis, and some of the different abilities which underlie human in- 
tellect, 
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Faculty Psychology. The belief was prevalent in the early nineteenth 
century that human behavior is determined by a large number of sepa- 
rate capacities, or faculties, as they were called (see 3, 12). There were 
proposed faculties of attention, memory, reason, will power, esthetic 
appreciation, and many others. It was the early belief that cach faculty 
resides in a particular brain location and that “bumps on the head” are 
prognostic of the strength of particular faculties (see 12, p. 56). Although 
the anatomical theories were soon discarded, the belief in a large number 
of specific human faculties lingered throughout the remainder of the 
nineteenth century. 

Binet and “General Intelligence.” Alfred Binet’s early efforts to measure 
intelligence were in line with the faculty psychology of the time. He 
worked at first with the same kind of sensory and motor tests that were 
used by Galton and others. Some of these included suggestibility, size 
of the cranium, tactile discrimination, graphology, and even palmistry. 
Like the others who were trying to measure human abilities, he found 
that these simpler functions do not measure intelligence, as we com- 
monly think of it. 

Binet abandoned sensory and motor tests in favor of a “molar” con- 
ception of intelligence. For practical reasons, Binet thought that it would 
not be possible to measure all of the simple skills that underlie intelligent 
behavior; instead, it would be more feasible to study the end products 
of intellectual functioning. Consequently, his tests emphasized ihe ability 
to make sound judgments. He defined intelligence as “the tendency to 
take and maintain a definite direction; the capacity to make adaptations 
for the purpose of attaining a desired end; and the power of auto-criti- 
cism” (15, p. 45). 

The Binet-Simon scale provided each child with one score, a mental 
age score (mental age scores were introduced in the 1908 revision). 
Because each child receives only one score, it must be assumed that 
intelligence is general, or that only one factor is involved in the test items. 
If there are, say, two kinds of intelligence involved in the test, the use 
of only one score is somewhat misleading. Two children could make 
the same score, not because they are alike in respect to their abilities but 
because one child is high in one of the factors and low in the other, and 
vice versa for the other child. The same assumption of the generality of 
human abilities is involved in all of the general intelligence tests which 
have followed from Binet’s early work. 


THE GENERAL FACTOR THEORY 


Spearman's G. At the same time that Binet and Simon were working 
in France on the first practical mental test, a psychologist in England, 
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Charles Spearman, was working from a different standpoint. Spearman 
hypothesized a general factor of intelligence, which he called the G 
factor. The G factor was thought of at that time as being a kind of 
“mental energy.” Unlike Binet, Spearman undertook extensive investiga- 


tic to determine whether or not there is only one common factor of 
intelligence (see 14). In order to do this, he administered batteries of 
tesis to school children, and by statistical analysis of the scores, he tried 
to show that only one common factor underlies all mental tests. In this 
work he introduced the techniques of factor analysis to psychology. 
Farlicr, Karl Pearson (13) had made some of the fundamental mathe- 
matical developments. 


Correlation Matrices. If a number of tests have been given to a group 
of people, each of the tests can be correlated with each of the others, 
resulting in what is called a correlation matrix. It is with such correlation 
matrices that factor analysis is undertaken. A hypothetical matrix show- 
ing the correlations between four tests is given in Table 9-1. To read the 


TABLE 9-1. INTERCORRELATIONS OF Four SCHOOL SUBJECTS 


Sp. Ar. Rd. Sc. 
Spelling CaN). 50 68 44 
Arithmetic 50 tae, AQ 71 
Reading 68 AQ (a) 55 
Science © 44 71 .55 a) 


matrix, simply look down a column and across a row to find the intersec- 
tion of two tests. By looking down the arithmetic column and across to 
the reading row, a correlation of .49 is found between arithmetic and 
reading. Note that each correlation appears twice in the matrix. A correla- 
tion of .49 is also found by looking down the reading column and stopping 
at the arithmetic row. The diagonal cells of the matrix have been left 
blank. These are the correlations of the tests with themselves, which are, 
technically, all 1.00, but are allowed to assume different values in the 
actual factor-analytic work. 

The Search for G. Spearman’s hypothesis can be more completely 
specified by the statement that mental tests have only one common factor 
and that all else is due to factors specific to each test. That is, part of the 
test variance can be accounted for by one common, or general, factor, 
and otherwise, mental tests have nothing in common. Spearman thought 
that each test would have some particularity, or some specificity, con- 
nected with its items, The concept of “variance explained” and the 
diagrams used to show the overlap of variances (from Chapter 7) can 
be used here to demonstrate factorial composition of tests. The over- 
lapping variances (squared correlations) for one common factor in three 
tests can be pictured as in Figure 9-1. 
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In Figure 9-1 the three tests overlap in only one area, the cross-hatched 
section, which is labeled G. Each test then has a portion of its area which 
is specific: it does not overlap with the G area and it does not overlap 
with the other tests outside of the G area. 


D 
fD Fıcure 9-1. Variance diagram of one common factor (G) 


in three tests. 


Spearman proved that if there is only one common factor in a set of 
test scores, the columns of coefficients in the correlation matrix will be 
proportional to one another (14). A hypothetical matrix which fulfills 
this requirement is given in Table 9-2. The proportionality of the columns 


in Table 9-2 can be tested by successively dividing the correlations in 
one column by those in any other, for all except the vacant diagonal cells. 


Going down columns A and B in this way, it is seen that the coefficients 
in A are 1.1 times as large as the coefficients in B (.67 over .61, .61 over 
56, and .55 over .50). The coefficients in any other pair of columns are 


proportional also, 


TABLE 9-2. A HYPOTHETICAL CorrELATION Matrix SHOWING ON: COMMON 
Facror 1N Five Tests ° 


A B Cc D E 


wyw» 
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a 
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55 50 45 ay it 


* All decimal points have 


e been removed; consequently, 74 should be understood as 
a correlation of .74, 


Twenty years or more were spent by Spearman and others in England 
trying to determine whether only one common factor, G, underlies all 
mental tests. The tests used in the first stages mainly concerned simple 
sensory discrimination, but these were soon abandoned for the more com- 
plex functions of the type made popular by Binet. Batteries of tests, 
usually containing no more than a dozen instruments, were administered, 
mainly to school children. Correlation matrices were computed and the 
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proportionality of columns was studied. Even if there were only one 
common factor among all tests, the actual correlation coefficients would 
not strictly obey the proportionality criterion because of sampling error. 
Consequently, very laborious studies of the departure from proportionality 
had to be undertaken, some of these lasting for months. 

\fter much investigation, at first in England and later by American 
psychologists, it became apparent that one common factor is not sufficient 
to account for the intercorrelations among mental tests. In a few restricted 
cases, where the tests were much alike, the proportionality criterion was 
approximated. When a broad range of tests was used, however, it was 
apparent that more than one common factor was needed. Spearman won 
his point, in part, in that one common factor can be shown to explain 
much, but by no means all, of the overlap between mental tests. 


MULTIFACTOR THEORY 


Rather than argue over whether one common factor is sufficient to 
explain intellectual behavior, the question soon shifted to that of how 
many factors are required to explain the intercorrelations of tests. The 
original statistical technique for determining a general factor is of little 
use when in fact more than one factor occurs. Procedures were developed 
for discovering how many factors are involved in a battery of tests. These 
statistical procedures are referred to as “multiple-factor analysis.” Pear- 
son (13) provided the original solution to the problem. Cyril Burt (5) 
in England, T. L. Kelley (11) in America, and later L. L. Thurstone (19 ) 
in America were prominent figures in the development of factor-analytic 
techniques. , 

Cluster Analysis. An approximation to factor analysis can often be 
mace by a simple inspection of the correlation matrix—consider the matrix 
in Table 9-8 for example. A cluster is defined as a group of tests whose 
members correlate highly among themselves and correlate relatively low 
with tests not in the cluster. 


° 
TABLE 9-3. A HYPOTHETICAL CORRELATION MATRIX For Six TEsTs 


Aen Bi a E Aa TF 


{ a AROA ROS 18 
orao 42 08 56 
64. A E E =19 
10 (Ce MT 
#3. 08, 165.4 J. E =04 
ei gue sheen ay OL. C) ? 


Yaa 
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° All decimal points omitted. 
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One way to begin a cluster analysis is first to locate the highest corre- 
lation in the matrix. In this case, it is the correlation of .72 between tests 
E and A. These two tests can form the nucleus of the first cluster. Any 
other test that correlates highly with both A and E can alse be placed 
in the cluster. Test C correlates .64 with A and .55 with E, so it can be 
classified in the cluster. None of the other three tests correlates highly 
with A, C, and E; therefore, no other tests belong to the first cluster, 

If there are substantial correlations among the tests which do not go 
into the first cluster, this means that one or more additional clusters 
remain. We then look at the correlations among tests B, D, and F, The 
three intercorrelations are 42, 47, and .56. There are then two clusters, 
or in the same sense, two factors, in these six tests. The two clusters can 
be seen more easily if we rearrange the correlations in Table 9-3 to make 
the first three rows and columns contain the tests for the first cluster and 
the last three rows and columns contain the tests for the second cluster. 
This has been done in Table 9-4. Table 9-4 has been sectioned off to 


TABLE 9-4, A REARRANGEMENT OF THE Cotumns anp Rows or TABLE 9-3 


Ae tor: C D BoF 
AEE a Ga 06 OL =13 
ENEO TO Be I 08 —04 
SeA 5a (| 08. 07 —19 
DARGA = 030i. 43, ay 
B Ol 08 —07 OMS ies 
E EIS 04! 19 47 56 ( ) 


show more clearly the intercorrelations of the tests within the two clusters 
and the correlations between the tests in the different clusters. Arranged 
in this way, it is easy to see that there are two different kinds of abilities 
involved in the six tests. 

If the clusters were generally as clear-cut and as easy to locate as they 
are in the example above, there would be little need for more formal 
procedures of analysis. When the number of tests is as large as that con- 
tained in present-day factor analyses, often more than fifty, it becomes 
very difficult to locate clusters. Also, it will almost never be the case 
that tests will divide themselves neatly into clusters such that each test 
belongs unequivocally to only one cluster, More often it is difficult to 
decide whether a test should be placed in one cluster or another, and 
the tests in one cluster usually have at least moderate correlations with 
the tests in other clusters, 

The Vector Model. A model which is more general than variance dia- 
grams is required to demonstrate how multiple-factor analysis works. 
Correlations can be portrayed by lines, or vectors, and by the angles be- 
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tween vectors. In Figure 9-2 three correlations have been pictured with 
vectors. In the vector model, each test is represented by one line. The 
closer the lines are to one another, the higher the two tests correlate. In 
the diagram on the left in Figure 9-2, the vectors for tests 1 and 2 are 


Fıcure 9-2. Three correlation coefficients represented by pairs of vectors. 


t-1 t-14 
r=.00 r=.40 a 
a 72.64 
oa ee 2 12 


at right angles, that is, 90 degrees apart. This always means a zero 


correlation between two tests. In the middle diagram, the vectors are 
less than 90 degrees, and consequently, there is a correlation greater 
than zero, .40, in this case. The diagram on the right shows an even 
smaller angle between vectors, portraying a correlation of .64. 

The vectors are not all of the same length in the vector model. The 
length of the test vector depends on the over-all amount of correlation 
the test has with the others in the matrix, the “communality.” The corre- 


lation between two tests in the model depends both on how large the 
gle is between two vectors and the length of the two vectors. A fuller 
expl ination of how this works would take us far beyond the mathematical 
complexity which can be permitted here (see 7, 16, 19). However, the 
meaning of the model can be grasped if it will be kept in mind that long 
vectors and small angles mean high correlations. The shorter the vectors 
and the nearer the angle approaches 90 degrees, the less the correlation. 

All of the intercorrelations in Table 9-4 can be cast in the form of one 
vector model, which is shown in Figure 9-3. The vector model demon- 
strates the two clusters noted earlier. Tests A, C, and E cluster together 


Ficuxe 9-3. Vector model of the correlation coefficients shown in Table 9-4. 
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in the space, and similarly, tests B, D, and F are near one another. The 
tests in one cluster are separated in general about ninety degrees from 
the tests in the other cluster, showing the low correlations between the 
tests in the different clusters. 

Factor Loadings. What factor analysis does is to place new lines, or 
vectors, in the vector model. The number of new vectors is generally 


Ficure 9-4, Vector model with factor axes. 


equal to the number of major clusters among the tests. Figure 9-4 shows 
how a typical factor analysis would locate two new vectors, F; and Fr in 
the vector model. The factor vectors go out in both directions from the 
origin, having both a plus and minus side. Describing the mathematical 
techniques which are required to place the factor vectors in the vector 
model would again take us into complexities which cannot be treated 
here. (The interested student can learn more about factor-analytic pro- 
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edures from 7, 16, 19.) However, it is not necessary for the reader 
completely to understand the mathematics of factor analysis to compre- 
hend what the factor vectors mean once they have been located in the 
vector model. 
Each of the factor vectors reads like a yardstick, with the numbers on 
he scale showing the amount of correlation with the tests. The two factor 
vectors constitute a coordinate system, much like the axes on a piece of 
graph paper. The correlations of the tests with the factors are referred 
to as loadings. Each test has a loading on each of the two factors. Read- 
ing from the graph, test A has a correlation, or loading, of .90 on Factor 
| and a loading of —.05 on Factor II. The full set of factor loadings, 
called a factor matrix, is given in Table 9-5. 


TABLE 9-5. Factor Matrix FOR Six Tests 


Test Loading 

F: Fu 
A -90 —.05 
E 80 05 
Cc 70 —.15 
D 10 60 
B 05 .70 
F —.10 80 


The Results of a Factor Analysis. A factor analysis can serve to sim- 
plify the original battery of test scores and the intercorrelations among 
the tests, The factor matrix in Table 9-5 shows that there are only two 
general kinds of abilities involved in the tests instead of six. At this point 
we do not know what these two underlying abilities are. Let us say that 
tests A, E, and C are concerned respectively with reading ability, com- 
prehension of written material, and vocabulary. We could then interpret 
Factor I as having to do with verbal ability, with the understanding of 
words, Say that tests D, B, and F concern addition, multiplication, and 
mathematical reasoning respectively. The second factor is concerned with 
numerical computation, the ability to perform arithmetic operations. 

After the factors are interpreted and named, a score can be given to 
each person on each of the factors. Although a precise determination of 
such scores requires some complicated calculations (see 16, chap. 15) a 
good approximation could be obtained in the present example by ad 
averaging the individual's scores on the three tests which have high load- 
ings on a factor. That is, if an individual's scores on tests A, E, and Cc 
are averaged, this provides a good approximation of the score which he 
has on Factor I. Similarly, scores on Factor IL could be obtained by 


averaging the scores on tests D, B, and F. 
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The Complete Model. The simplified example portrayed in Figures 


9-3 and 9-4 requires only two dimensions, which is another wiy of saying 
that the vectors lie completely on the flat surface of a page. Fhe angles 
between the vectors might be such that they could not be pi! tured on a 
flat surface, and at least a three-dimensional model would be required. 


In practical work the test vectors will seldom lie on a flat surface, and 
even three dimensions are often not sufficient for portraying the model, 
Of course, when the model requires more than three dimensions, there 
is no physical way of portraying the vectors. However, this does not 
hinder the mathematical derivation of four, five, and even ten dimensions, 
or factors, to account for the correlations among tests. 

The ultimate aim in performing factor-analytic studies is to reduce the 
nearly infinite number of tests which can be composed to a smaller num- 
ber of basic abilities, or factors. This not only achieves a real parsimony 
in practical work; it helps us to understand the nature of human abilities. 
Factor analysis is most useful in the early stages of research, when not 
much is known about an area of investigation. It is a very helpful way 
of “mapping” the territory before proceeding to a more detailed study 
of particular attributes. 

Correlations among Factors. In Figure 9-4 the two factor vectors are 
placed 90 degrees apart, which signifies that there is a zero correlation 
between the two factors. However, in many factor analyses the factor 
vectors themselves are allowed to correlate, or in other words to assume 
angles different from 90 degrees. If the correlation between any two 
factors is allowed to become too large, then the factors begin to lose 
their value as “different” kinds of abilities. It is not uncommon to find 
factor solutions in which the correlations between particular factors are 
as high as .50. Considering that the factors themselves are correlated, 
this means that factors represent separable but not completely independ- 
ent abilities. l 

Factor analysis can be used in many different kinds of problems, but 
our primary concern will be with its application to psychological tests. 
Factor analyses have been made of mechanical abilities, interest tests, 
and personality tests, as well as of the more intellectual functions. We 
will look at the factor-analytic results in the domain of tests of the in- 
tellectual type in this chapter and later refer to factor-analytic findings 
that relate to other kinds of human attributes. 


FACTORS OF ABILITY 


T he large amount of factor-analytic work performed during the last 
fifty years permits us to speak with confidence about at least some of the 
factors which underlie tests of the ability type. One cardinal finding is 
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that the correlations among human abilities are almost always positive, 
even il small in some cases. Consequently, it would be rare to find one 
hum: ability for which high performance indicates that the individual 
will uo poorly in another type of performance. Because of the tendency 
for «ll tests of ability to correlate positively with one another, the factors 
whic!) have been found also tend to correlate positively with one an- 
otho.. This is a partial confirmation of Spearman's G. Spearman was both 
correct and incorrect: there is a moderate tendency for the individual 
to do as well or as poorly on nearly all ability tests, but there are also 
definite clusterings of tests into separable factors. 

How Many Factors? Factor analysts have differed not so much in the 
kinds of factors they have found, but in how many factors they have 
derived. The number of factors used to explain ability tests is dependent 
on how finely one wants to attack the problem. One could be content 
with only the general factor reflected in the tendency for tests to corre- 
latc positively with one another. Or, the factor analyst can stop with 
the several most prominent factors, those which explain most of the 


test variance. Most of the British factor analysts have been content 
with exploring only the major factors. At the other extreme, one can 
continuously search for more and more factors of less and less impor- 
tance. 

\n analogy may help in portraying the question of how many factors 
should be derived. In constructing a topographic map, the cartographer 
has a choice as to how finely differences in elevation will be shown in the 
contour lines. If contour lines are used to show differences in elevation 
of 1,000 feet, a mountain might appear as one smooth peak. If contour 
lines were used to show differences in elevation of 200 feet, the increased 
detail might show that there are actually several peaks on the same base. 
Increasing the fineness of the contours to show differences in elevation of 
10 feet would bring out considerably more detail about the terrain. One 
can imagine a logical extension to the use of contours to reflect inches 
of elevation, and then every stone would be visible. The cartographer 
must strike a balance between two opposing standards. The more grossly 
the contours reflect differences in elevation, the easier the terrain is to 
survey and to picture as a map. The more finely the contours reflect 
elevation, the more nearly the map approaches the actual terrain; but, 
after a point, the map becomes difficult to interpret. 

The factor analyst faces much the same problem as the cartographer: 
he must decide at what point a continued derivation of highly specialized 
factors no longer adds to the knowledge of human abilities. Most of the 
work in factor analysis has concentrated on the derivation of new factors 
rather than on exploring the usefulness of the factors which are derived. 
Although the major factors have already proved useful as predictor tests, 
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it remains to be seen whether or not some of the factors will add to the 
predictiveness of test batteries. When more research is done on the 
meaning and usefulness of factors, we will be in a better position to say 
which factors are important human attributes and which are merely 
artifacts of psychological tests. 

A Factor-analysis Problem. To illustrate the steps that are required in 
a factor analysis, the research reported by the Thurstones in 1941 (20) 
will be described. In an earlier analysis of tests given to college students, 
Thurstone and his colleagues had found nine factors (17). The purpose 
of the second analysis was to see if the same factors would be found in 


children. Sixty tests were constructed for this purpose—a considerable 
amount of test construction. Tests which had produced particular factors 
in previous analyses were included along with new tests, which it was 
hoped would point to undiscovered factors, In order for important factors 


to appear, it was necessary for the tests to range as broadly as possible 
across the ability functions. The tests varied from familiar forms like 
arithmetic and vocabulary to the solving of picture puzzles and the 
counting of dots on paper. 

Each test was separately standardized on a tryout group of children. 
Then test forms were reproduced and arrangements were made for the 
administration of materials, All 60 tests were given to 710 cighth-grade 
students in the Chicago schools. The tests were given, a half-dozen at a 
session, over a period of 10 school days. 

All intercorrelations were computed for a 63 by 63 matrix. This in- 
cluded intercorrelations of the 60 tests plus the three additional variables 
of age, sex, and mental age (obtained from previous testing). This re- 
quired the computation of 1,953 correlation coefficients, Elaborate factor- 
analysis procedures were then applied to the correlation matrix. In all, it 
was a very large research venture, as all such substantial factor-analytic 
studies tend to be. 

The Thurstones found seven definable factors in the study of children. 
Previous research in England and America and research since the Thur- 
stones’ study have generally supported the seven factors and added a 
number of others. The major factors which have been found are described 
in the following section. 


o 


VERBAL FACTORS 


(Ve) Verbal Comprehension. This factor has been found in many 
different factor analyses (cf. 4, 6). It concerns the ability to understand 
words and written material. Some tests which have high loadings on Ve 
are vocabulary, sentence completion, and reading comprehension. 
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1. Which one of the following words means most nearly the same as 
salutation? 
a. offering 
b. greeting 
c. discussion 
d. appeasement 
2. Which one of the following words is most nearly the opposite of 
languid? 
a. unemotional 
b. sad 
c. energetic 
d. healthy 


(V;) Word Fluency. This is another verbal factor that has been found 
in numerous studies (see 4, 6). It is concerned with the production of 
words, in contrast to V,, which is concerned with the understanding 
cf words. Any test which requires the subject to produce words rapidly 
has some loading on V;. Although there is a moderate correlation between 
V, and V; —a correlation of .42 in the study described above (20)—the 
‘wo abilities are well distinguished. It is possible for an individual to be 
able to produce words rapidly with little understanding of the meaning of 
words. The factor V, tends to be prominent when there is a range of diffi- 
culty in words. Factor V; becomes more important when the words are 
simple and the concern is with how rapidly the words are produced. The 
contrast between V, and V; fits in well with the observation that there are 
individuals who have difficulty in expressing their own ideas but are 
quite good at understanding written material. 

Typical items: 


1. Write as many names of foods as you can in the next two minutes. 


2. In each of the following rows write three words that mean almost 
the same as the given word. 
small 
helpful 
kind 
(Note that the words here are simple and that the subject is 
required to produce the new terms. ) 
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NUMBER FACTORS 


(N) Numerical Computation. One very clear numerical factor has 
been found in different analyses (see 4, 6). It concerns th speed of solv- 
ing arithmetic problems of all kinds—addition, subtraction. multiplication, 
and division problems. The factor is not present in all tests that employ 
numbers but only in those tests where actual computations are required. 
Some of the perceptual factors and reasoning factors to be discussed 
employ numbers but have little of the factor N. 

Typical items: 


246 8754 
+943 — 381 


10X = s 284/4 


REASONING FACTORS 


Reasoning is a complex domain of attributes. Some of the factors have 
been found in a number of investigations and others are still in dispute 
(see 6, 8,9). 

(R,) General Reasoning. The most common, and most commonly 
found, factor of reasoning is concerned with the ability to invent solutions 
to problems. Arithmetical reasoning problems are most characteristic 
of R,. 

Typical items: 


1. If two men can build one house in twelve weeks, how many 
houses can twelve men build in two weeks? 

2. How would you get exactly 7 quarts of water from a stream if you 
had one 5-quart container and one 3-quart container? 


(Ra) Deduction. This factor is concerned with the drawing of conclu- 
sions, as in logical syllogisms. In this type of reasoning there is nothing 
in particular to be discovered or invented, the ability being concerned 
with evaluating the implications of an argument (sce 8, 9) 

Typical items: 


1. John is younger than Fred, 
Bill is older than Fred; therefore, Bill is _—— than John. 
2. A student has 10 marbles. No one else in his class has 10 marbles. 
This means that 
a. no one else in the class has marbles, 
b. all of the other students have less than 10 marbles. 
c. some of the students have less than 10 marbles. 
d. some of the students have more than 10 marbles. 
e. only one student has exactly 10 marbles. 
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(R) Eduction of Relationships. A third factor of reasoning has 


appeured often enough to be cited here, but it is not as firmly established 
as R, and Ry (see 8, 9). The factor involves the ability to see the relation- 
ship between two things or ideas and to use the relationship to find other 
things or ideas. The factor is best represented by verbal analogies and 
design analogies. 


Typical item: 


Ship is to sail, as automobile is to 
a. ship 

hb. seat z 

c. motor 

d. wind 

e. driver 


MEMORY FACTORS 


The domain of memory factors is complex and only partially explored. 
A recent factor analysis by Kelley (10) has helped considerably to clarify 
the area. 

(M.) Rote Memory. The best-established factor of memory concerns 
the ability to remember simple associations where meaning is of little or 
no importance (see 4, 6, 10). 

Typical item: 


The subject is given a list of names, each of which is paired with a 
number. He is given a minute or so to memorize which number goes 
with which name. Then he is told to turn the page. The next page 
contains the list of names without the numbers, The subject is 
instructed to write the proper numbers next to the names. 


Other items which concern M, are of the same general kind as the one 
shown above. They would be such as pairing colors with words, initials 
with last names, and photographs with geometrical forms. SA 

(M,,) Meaningful Memory. There is substantial evidence to indicate 
that there is a factor involved in the retention of meaningful relationships 
which is separable from rote memory (see 10). Factor Mm aed 
the subject is requested to memorize sentences, meaningfully rela’ 
words, and lines of poetry. 

Typical items: 


1. The subject is asked to read and try to remember a list of sentences 


pe by welding the broken axle 
ohn repaired the wagon by weldmg n axle. ' 
i list of sentences is taken away and the subject is then given 
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the same sentences with one or more of the words deleted from 
each, like the following: 
John repaired the wagon by welding the broken __ F 
2. The subject is shown a list of meaningfully related pairs of words 
such as the following: 


dog bark 
shoe leather 
hard candy 
small box 


The list is taken away and the subject is presented with only one 
member of each pair as follows: 

dog 

shoe 

hard 

small 


There is evidence for several other memory factors. A number of inves- 
tigators have reported a memory span factor concerning the ability to 
recall perfectly for immediate reproduction a series of unrelated items 
(cf. 6, 10). A typical item would be to read a series of five to a dozen 
numbers and ask the subject to give the numbers back in their exact order. 
There is also some evidence for a visual memory factor in which the abil- 
ity to grasp the relationships within a picture or pattern is important (see 
6, 10). A typical item would consist of showing an individual a landscape 
picture and asking him to remember the details, Then the picture would 
be taken away, and the subject would be presented questions like “How 
many sheep were in the picture?” “What was the boy handing to the 
man?” “Where was the swing located?” The visual memory factor might 


be related to the ability to remember faces, license numbers, and wit- 
nessed events, 


SPATIAL FACTORS 


(So) Spatial Orientation. This factor concerns the ability to detect 
accurately the spatial arrangements of objects in respect to one’s own body 
(see 6, 18). The factor would be necessary in deciphering pictures taken 
from a maneuvering airplane. If the plane is simultaneously turning and 
climbing, the landscape looks very different from the normal view that 
we get. The individual who can accurately detect what maneuver the 
airplane is going through from looking at only a picture of the landscape 
from that vantage point is high in spatial orientation. The factor appears 
most prominently when the spatial problems are presented under 
“speeded” conditions, 
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(S.) Spatial Visualization. This second spatial factor differs in a subtle 
manner from S,. It is present when the subject is required to imagine, 
or “visualize,” how an object would look if its spatial position were 
changed. Although there is good statistical support for both of these fac- 
tors, there has been considerable difficulty in understanding the under- 
lying processes (see 6, 9, 18). Spatial orientation seems to require either 
an actual or imagined adjustment of one’s own body, In spatial visualiza- 
tion the observer cannot solve the problem by a bodily adjustment; 
instead, he must conceive of how an object would look if its spatial position 
were markedly changed. In contrast to spatial orientation, spatial visuali- 
zation is best tested under relatively “unspeeded” conditions. 

lypical item: 


The subject is shown a folded piece of paper with a number of holes 
punched in it. He is asked to choose from a number of alternatives 


how the paper would look if unfolded. 


PERCEPTUAL FACTORS 


\ number of factors have been found which concern the ability to 
detect accurately visual patterns and to see relationships within and 
among patterns. Some of these factors are apparently of only limited 
importance, such as the ability to judge certain types of illusions. Several 
of the more important factors will be described. f 

(P.) Perceptual Speed. This factor is concerned with the rapid recogni- 
tion of perceptual details and particularly with the similarities and dif- 
ferences between visual patterns (see 9, 18). 

Typical items: 


1. The subject is shown a geometrical form and asked to choose from 
a number of other forms the one that is the same as the first. 

2. The subject is told to make a check mark by each pair of letter 
groupings if they are identical and to make no mark if they are 
different. 

x’ #.1q X’#.1Q 

a&30(k a&3(oK 
nee _ro-/w _ro-/w i 
(Fifty to several hundred pairs would be used, depending on 
the time allowed. ) 


(P.) Perceptual Closure. This factor concerns the perception of es 
from limited cues (see 9, 18). The word closure” means a sudden 
awareness of an obscure object or relationship. Perceptual speed ee 
only the recognition of a perceptual form. Perceptual closure requires the 
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“putting together” of a perceptual form when only part of it is presented. 
(See Figure 9-5. ) 

There is some evidence for the existence of a flexibility of closure factor 
in problems which require the subject to detect one perceptual patter 
which is imbedded in a distracting or competing pattern (sce 9, 18). This 
factor is found in such items as the hidden-picture games that are printed 


Ficure 9-5. Items pertaining to the Perceptual Closure Factor. The configuration in 
each frame is projected briefly on a screen. The subject must determine the hidden 
word, letter, or number. (Adapted from Thurstone, 18, pp. 13, 21.) 


in newspapers and some children’s magazines. At first glance. the picture 
looks like a normal landscape, but after careful scrutiny, 2 number of 
faces can be found hidden in the trees and rocks. In order to see the 
faces, it is necessary to resist or “break down” the perception of the 
object in which the faces are imbedded. 


SPEED 


The question of whether a test should be “speeded” or not has come up 
a number of times in previous chapters. Nearly all tests have a time limit 
for practical purposes, if for no other reason. In a vocabulary test, for 
example, there is no intention of hurrying the subject, and the time limit 
is usually liberal enough to satisfy most individuals. Some abilities are 
intimately connected with speed, such as typing and shorthand speed; and 
also some of the factors which have been discussed, such as word fluency 
and perceptual speed, can only be measured by the use of “speeded” tests. 

Considerable interest has been shown over the last fifty years in deter- 
mining whether or not speed and accuracy are the same in psychological 
tests. Although the findings on this topic have not been consistent and the 
answer is not the same with all tasks, the general conclusion is that speed 
goes along with accuracy to an extent, but that these two attributes are far 
from perfectly correlated. A comparison of scores given under “speeded” 
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nd “nonspeeded” conditions shows that at least one and perhaps a num- 
Der of speed factors can be isolated (see 4, 6). It is not known at this 
me just how speed factors interact with the other ability factors. 


MULTIFACTOR TEST BATTERIES 


From this point on, the book will take a practical slant with the exami- 
sation of typical tests and measurement methods in use in psychology. No 
{fort will be made to give an exhaustive list of current tests. There are 

thousands of commercial tests available, and any effort to describe a 
sizable fraction of these would provide very dull reading. Tests come and 
go, and better ones will always be forthcoming as our knowledge of test- 
ing increases. Rather than make this a catalogue of tests, the book is 
intended to impart a logic for measurement methods which can be applied 
to tests in general. A list of test publishers is presented in Appendix 1. The 
individual who uses tests in professional work will want to obtain test 
catalogues and in this way keep up with new tests as they are developed. 

In the sections to follow, several tests will be described to illustrate the 
measurement of each type of human attribute. The usual procedure will 
be first to discuss one of the older tests in each area to provide historical 
perspective; then, one or several of the leading tests in the area will be 
discussed. 

Judging a Multifactor Battery. Differential aptitude tests can be con- 
structed either rationally or from the results of a factor analysis. Preferably 
the battery should come from a factor analysis. Otherwise there is no 
guarantee that the tests will measure different attributes or that the tests 
will cover a wide range of attributes. If the battery is derived from factor 
analysis, each of the tests should have high “factorial validity.” A test has 
high “factorial validity” if it correlates highly (has a high loading) with 
the factor it is intended to measure. 

The statistical necessities of efficient differential prediction as described 
in Chapter 7 apply to all differential aptitude, or multifactor, batteries. 
Reiterating, each of the subtests should have high individual reliability 
and should correlate as low as possible with one another. On the practical 
side, differential aptitude batteries should be judged in terms of the 
amount of research which has been done on the instrument, the repre- 
sentativeness of the norms, the appropriateness of the battery for different 
groups of persons, and the time and expense of using the tests. 

The Primary Mental Abilities Test (PMA). The PMA tests were devel- 
oped directly out of the factor-analytic work of Thurstone and his col- 
leagues, one phase of which was described previously. This was the first 
comprehensive multifactor battery and marked a milestone in psycho- 
logical measurement. The first regular edition appeared in 1941. It was 
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to represent each of six factors, The following factors wore represented: 
(1) rote memory, (2) verbal comprehension, (3) word fluency, (4) nu- 
merical computation, ( 5) spatial orientation, and (6) general reasoning, 
Science Research Associates (SRA) later assumed publication of the 
PMA series and issued forms for the 5-to-7, and 7-to-1]. as well as for the 
11-to-17-year levels. The batteries have been considerably shortened, with 
only one short test being used to measure each factor. Some of the factors 
are not represented at one or more of the age levels: for example, the rote 
memory factor is dropped from the 11-to-17-year battery. F igure 9-6 
shows typical items which appear on current forms of the PMA. 
Evaluation of the PMA. In spite of the very important research which 
preceded the PMA, the current tests fall short of the standards of an 
efficient multifactor battery. The fact that only one short test is used to 


Ficurr 9-6, Sample Items from the SRA Primary Mental Abilities Tests for Ages 11 


to 17. (Reproduction by permission of Thelma G. Thurstone and Science Research 
Associates, ) 


Verbal meaning is tested by multiple choice vocabulary items like the following: 
Safe: A. secure B, loyal C. passive D. young 
The word which means most nearly the same as the first word is to be marked. 


Space is tested by items like the following: 


PITA] 
d b (c) (7) (e) 


(f) 
Every figure that could be obtained by a rotation of the first figure is to be marked. 


Reasoning is tested by items like the following: 


Le im eimekm fe] LT m| | n | 


The letters form a Series based on a rule. The subject is to discover the rule and 
mark the next letter in the series, 


Number is tested by items like the following: 
63 
ze Lae] 


89 


Word fluency is tested by re 


j c quiring the subject to write as many words beginning 
with a certain letter as possib 


le during five minutes. 
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measure each of the factors has two unfortunate consequences. First, 
to have high factorial validity, it is usually necessary to use several tests. 
By averaging scores, much of the specificity of each test is canceled and 
more of the underlying common factor is tapped. Secondly, either aver- 
aging the scores on several tests for each factor or making one longer test 
for each factor raises the reliability. 

In some forms of the SRA PMA, reliability statistics have been either 
incorrectly computed or inadequately reported. Split-half reliability esti- 
mates have been made without regard to the highly speeded nature of 
some of the tests. In a restudy of the PMA reliability, Anastasi (1, p. 366) 
found that when properly computed some of the reliabilities fell mark- 
edly. For example, the word fluency test reliability dropped from a 
reported .90 to an actual .72, and the spatial relations test from .96 to .75. 
These reliabilities are too low for an adequate multifactor battery. 

Compounded with the insufficient reliability of some of the PMA sub- 
tests, there are moderately high correlations among them. For the 
seven-to-cleven-year form, intercorrelations were found to range from .41 
to .70; and for the five-to-seven-year battery, intercorrelations ranged from 
51 to .73. The combination of relatively low reliabilities on some subtests 
and generally high intercorrelations of subtests considerably weakens the 
PMA as a multifactor battery. 

Although the PMA test manuals make claims for the specific vocational 
and educational implications of scores, only a small amount of actual re- 
search has been done with the battery. Although norms are given for the 
battery, very little information has been obtained about the nature of the 
populations tested. No information is given on sex differences even for 
the subtests on which significant sex differences are usually found. 

The PMA tests have not kept up with the increasing knowledge about 
ability factors. For example, the reasoning test employed in the 11-to-17- 
year battery, the series completion type of test, has since been found to 
be a mixture of factors and is not the best test to measure general reason- 
ing. Also, a more modern multifactor battery should include several kinds 
of reasoning tests in line with more recent factor-analytic findings. Sub- 
sequent research might show that the PMA tests depend excessively on 
speed. Some of the tests, such as word fluency, should be speeded, but it 
is doubtful that speed will add to the predictiveness of other tests, such 
as reasoning and verbal comprehension. , 

The Differential Aptitude Tests. The Differential Aptitude Tests 
(DAT) were developed by The Psychological Corporation, principally 
for use in the vocational and educational guidance of high school stu- 
dents (2). Although the tests were intended for use with students 4 
grades § through 12, they have a sufficient range of item difficulty to be 
used with most adult groups. The tests were not developed directly out 
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of factor-analytic work but were composed in such a way as to incorpo: te 
some of the major findings from factor analysis. d 

The subtests of the DAT are as follows (see Figure 9-7 for sample 
items): 

1. Verbal Reasoning. The items consist of verbal analogies, in which the 
reasoning component rather than the difficulty of words is emphasi 
The test is more concerned with the reasoning factors than with verbal 
comprehension. 

2. Numerical Ability. The test covers a wide range of numerical compu- 
tations. It should be a good measure of the numerical computation facto 
as it was previously described. 


in that all of the problems deal with abstract patterns. 
4. Spatial Relations. The test items concern the individual’s ability 
imagine how objects would look if they were rotated in space and to 
visualize a three-dimensional object from a two-dimensional pattern. 
5. Mechanical Reasoning. The test items consist of pictures which por- 


picture. 

6. Clerical Speed and Accuracy. The test is modeled closely after 
perceptual speed factor. The subject is required to find identical sets 
numbers and figures. 

7. Language Usage. This test is more concerned with acquired knowl- 
edge, or achievement, than with a specific aptitude. The two parts of the 
test concern spelling ability and grammatical usage in sentences. 

Evaluation of the DAT. Although the DAT was not derived directly 
from factor analysis, the tests represent a collection of reasonably inde- 
pendent measures which range broadly over those factors most directly 
related to school achievement. The tests were constructed primarily for 


Ficurr 9-7. Sample items from the Differential Aptitude Tests. (Reproduced by pet 
mission of The Psychological Corporation. ) 


Verbal Reasoning 


Seas. ae is to one as second is to..... 


1. middle 2. queen 3. rain 4. first 

A. two B. fire C. object D. hill 
ei is to night as breakfast is to..... 

Is flow 2. gentle 3. supper 4. door 

A. include B. morning C. enjoy D. corner 


Pick out words which will fill the blanks so that the sentences will be true and 


sensible. For the first blank, pick out one of the numbered words. For the second 
blank, pick out one of the lettered words, 
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Ficune 9-7.—Continued 


Abstract Reasoning 


Problem figures Answer figures 
ta) tô) (¢) ig) te) 
Problem figures Answer figures 


ta) tô) te) tg) le) 


Each row consists of four figures that are called Problem Figures and five called 
\nswer Figures. The four Problem Figures make a series. You are to find which one 
f the Answer Figures would be the next, or fifth one, in the series. 


Space Relations 


CVOIVSS 


tó) te) ig) (e) 


The test consists of forty patterns which can be folded into figures. For each pat- 
tern, five figures are shown, You are to decide which of these figures can be made 


from the pattern shown. 


Numerical Ability 
Add 13 A 14 Subtract 30 A 15 
12 B 25 20 B 26 
D C 16 = C 16 
D 59 D 8 
E none of these E none of these 


Choose the correct answer. 
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Ficure 9-7.—Continued 


Mechanical Reasoning 


Which man has the heavier 
load? (If equal, mark C.) 


Which weighs more? 


(If equal, mark C.) 


Language Usage Part 1: Spelling 
man 
gurl 
catt 


dog 
Indicate whether or not each word is spelled correctly. 
Language Usage Part II; Sentences 
Aint we / going to the / office / next week / at all. 
A B Cc D E 
The test consists of a series of sentences, each divided into five parts, lettered A, B, 


G D; and E. You are to look at each and decide which of the lettered parts have 
errors in grammar, punctuation, or spelling, 
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Ficure 9-7.—Continued 
Clerical speed and accuracy 


Test ITEMS SAMPLE OF ANSWER SHEET 


> AD AE AF 
W. aA aB BA Ba Bb 
X. A7 7A B7 IB AB 
Y. Aa Ba bA BA bB 
33 B3 BB 


v 
w 


Each test item consists of five letter and number combinations. You are to look at 
the one combination that is underlined and find the same combination on the answer 
sheet. (The examples above are all correctly done. ) 


school programs, and it is doubtful that they would meet with the same 
level of success in other testing situations. The correlations between the 
tests vary considerably, going from .06 to as high as .67. The intercorre- 
lations run about .50 on the average. Although it is desirable to have lower 
correlations, they are not so high as to prevent the battery from function- 
ing as a measure of differential aptitudes. 

The reliabilities of the tests are generally high, ranging from .85 to .93 for 
all except the mechanical comprehension test. The reliability for men on 
the mechanical comprehension test is sufficiently high, with a mean coeffi- 
cient of .85, but the reliability for women is only .71. 


TABLE 9-6. MEDIAN CORRELATIONS OF DIFFERENTIAL APTITUDE TEST SCORES 
WITH SCHOOL GRADES ° 


(Adapted from G. K. Bennett, et al., 2) 


Soc. 
Eng- Sci- stud., Lan- Typ- Short- 
Aet lish Math; ence his- guages tne hand 
tory 
Verbal Reasoning (VR) 50 39 54 50 30 19 44 
Numerical Ability (NA) 48 50 51 48 42 32 27 
Abstract Reasoning (AR) 36 35 44 35 25 27 24 
Space Relations (SR) 27 32 36 26 15 16 16 


Mechanical Reasoning (MR) 24 22 38 24 17 14 14 
Clerical Speed and Accuracy 


(CSA) 24 19 26 26 23 26 14 
Spelling (Spell. ) 44 29 36 36 31 26 55 
Sentences (Sent. ) 52 36 48 46 40 30 49 


nn  _—$—_<$$——————————— 


* Decimal points omitted. 
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A particular point in favor of the DAT is the large amount of research 
that went into the standardization and validation of the instrument. The 
norms are based on the testing of 47,000 students in graces 8 to 12 scat- 
tered widely over the United States. Because sizable sex differences were 
found on some of the tests, separate norms are given for boys and girls. 

Thousands of correlations between DAT scores and various criteria 
have been computed. Some correlations between test scores and class 
grades are shown in Table 9-6. A follow-up study was undertaken of 
students two years after completing high school. The resvlts are shown in 


TABLE 9-7. PERCENTILE EQUIVALENTS OF AVERAGE Scones ON THE DAT 
FOR MEN IN VARIOUS EDUCATIONAL AND OCCUPATION AI. GROUPS 


(Adapted from G. K. Bennett, et al., 2) 


Percentiles 


No. VR NA AR SR MR CSA Spell. Sent. 


Group 


Degree-seeking students: 


Premedical 24 88 86 81 72 77 771 90 90 

Science (biology, chemis- 

try, and math.) 25 81 8 60 67 68 ‘4 sl 72 

Engineering (includes 

architectural) 70 80 86 80 81 82 6f 68 74 

Liberal arts (includes 

prelaw) 68 79 75 78 61 64 7 78 A 

Business administration 64 72 73 61 63 60 Ti 67 68 

Education (includes 

physical education) 25 68 66 58 57 48 68 67 © 

Various: predental, agri- 

cultural, ete. 30 64 67 74 60 73 59 55 64 
Non-degree students in 
two-year schools: 

Business, technical, fine 

arts, etc. 43 63 62 71 72 68 60 54 61 
Employed: 

Salesmen 23 56 53 53 52 57 44 50 49 

Clerks: general office work 55 45 42 52 44 41 4l 52 53 

Mechanical, electrical, and 

building trades 66 34 41 46 49 56 4 8 2 

Various skilled: butcher, 

baker, ete, 26 47 87 43 41 49 51 52 £ 

Various unskilled: truck 

driver, laborer, ete, 8 35 30 36 42 49 37 35 3 
Military service 129 46 42 46 51 50 49 46 4l 
Unclassified: 

No consistent work or 

school record 58 53 48 51 54 54 46 47 52 
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able 9-7. In Table 9-7, scores for the total group are broken down in 
ms of subsequent occupation or college specialty. Some of the groups 
: Table 9-7 are represented by only several dozen individuals, and con- 
sequently, the results should be considered tentative. However, there is 
) apparent tendency for occupations to be represented not only by differ- 
es in performance level but by “shape” characteristics as well. More 
carch of this kind will provide the DAT and other similar test batteries 
with a firmer basis for vocational counseling. The DAT is a model of 
careful test design, practicality, thoroughness of research, and frankness 
of reporting, 
Other Differential Aptitude Batteries. A number of other batteries are 
available. Some of these will be listed and briefly described. 

General Aptitude Test Battery (GATB). This is a very broad battery of 
15 tests constructed by the U.S. Employment Service. More attention 
would be given to the battery here, but unfortunately the tests are avail- 
able only to state employment services. The tests were developed directly 
from factor-analytic studies, and, in most cases, scores from more than one 
test are used for each factor. The following attributes are measured: 

7) called “general intelligence,” (V) verbal aptitude, (N) numerical 
aptitude, (S) spatial aptitude, (P) form perception, (Q) clerical percep- 
tion, (A) aiming, (T) motor, (F) finger dexterity, and (M) manual 
dexterity. 

Flanagan Aptitude Classification Tests (FACT ). The battery consists 
of 14 relatively independent tests ranged broadly across the ability func- 
tions. The battery is published by Science Research Associates. Weight- 
ings are given for the prediction of 30 occupational groupings. Almost no 
evidence is presented to support the claims of validity. 

Guilford-Zimmerman Aptitude Survey. The tests in this series are an 
important effort to represent a large share of the known factors of human 
abilities. The tests are constructed on the basis of factor analyses under- 
taken in the Air Force. Each test is intended to have high factorial validity 
for one of the known factors. The present battery includes tests for the 
following factors: verbal comprehension, general reasoning, numerical 
computation, perceptual speed, spatial orientation, spatial visualization, 
and mechanical knowledge (not an aptitude factor). Tests are being 
prepared to represent more factors. 

An insufficient amount of research has been done with the battery to 
know how well it will work. The evidence so far suggests, as would be 
expected, that using only one test does not provide an adequate measure 
of each factor. The battery is still in an experimental stage. j 

l The Status of Multifactor Batteries. Whereas there has been a consid- 
erable amount of research done to find new factors of human ability, 
relatively little research has been done to develop test batteries for prac- 
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tical use. This is due in part to the tendency for measurement specialists 
to be more interested in the theoretical aspects of the problem than in the 
more mundane practical aspects. Another reason is that test batteries can 
often be sold for considerable profit, and it is tempting to place tests on 
the market before their actual worth has been determined. 
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CHAPTER 10 


General Intelligence Tests 


The text discussion is far enough along to permit a more direct explana- 
tion of what “general intelligence” tests measure. In Chapter 9 it was 
shown that the domain of intellectual abilities is not entirely “general,” 
that a number of separable factors can be shown to underlie tests of 
human ability. The tests which will be discussed in this chapter are gen- 
eral instruments in that the individual receives only one score. The one 
score is an over-all, or general, measure of his ability to perform the 
different kinds of material in the test. 

What the General Intelligence Tests Measure. Having gone to some 
lengths to show that the concept of “general intelligence” is somewhat 
misleading, at this point let us consider the arguments supporting the 
practical use of general measures. The general intelligence tests take 
advantage of the fact that the separable factors of intellect are themselves 
correlated. In addition, not all of the factors are intimately related to 
intelligence as it is popularly conceived. Factors such as verbal compre- 
hension and general reasoning are more commonly thought of as consti- 
tuting intelligent behavior than are factors like spatial orientation and per- 
ceptual speed. Therefore, the general intelligence tests have tended to 
incorporate material largely from the factors which more nearly fit the 
popular conception of intelligence. The use of one score represents pri- 
marily the adding together of only several factors rather than the many 
that can be found in diverse tests. 

The content of most general intelligence tests is predominantly related 
to the verbal comprehension factor, the numerical computation factor, and 
various mixtures of the reasoning factors. The remaining material in 
these tests tends to be scattered thinly over the memory, spatial, and 
perceptual factors. Most of the intelligence tests correlate highly with 
one another, coefficients as high as .80 being commonly found between 
some of them. However, the correlations are in most cases far enough 
from perfect to show that the tests measure partially different things, or 
factors. Factor-analytic studies have been made of some of the major 
tests, and these findings will be discussed in the pages ahead. 
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Some psychologists argue for the use of general measures because they 
have not given up the concept of general intelligence. 'Yhey remain sus- 
picious of factor-analytic findings and maintain a “holistic” view of intel- 
ligence. Their viewpoint is that the individual’s “effective intelligence” 
comes from a pattern or combination of underlying factors that is diff- 
cult to predict from scores on the separate factors. There is some support 
for this viewpoint, particularly in relation to school work. It is possible 
for individuals to solve the same problem successfully yet use different 
approaches. For example, one individual may rely heavily on his spatial 
ability to help solve geometrical problems. Another person inay solve the 
same problems more abstractly by relying on his reasoning ability. 

Another argument for the holistic conception of intelligence is that the 
grades in different school courses tend to correlate highly with one 
another. This is perhaps partially the fault of teachers, who tend to form 


over-all impressions of students; but it also indicates that to an extent 
the underlying factors are somewhat interchangeable in real-life situa- 
tions. Regardless of the theoretical merit of the holistic viewpoint, the 


kinds of tests which have traditionally been used as measures of general 
intelligence are composed of a number of different factors, 

Another argument for the use of general intelligence tests arises from 
the defects in many multifactor, or differential aptitude, batteries. It was 
previously said that an efficient differential aptitude battery must have 
highly reliable subtests which correlate low or only moderately with one 
another. Otherwise, the battery will manifest only statistically insignifi- 
cant differential scores, and it would be as well to add the tests together 
to form one dependable measure. This is a telling argument against some 
of the less carefully constructed multifactor batteries. However, as more 
research is done on the construction of multifactor batteries, the general 


intelligence tests will probably assume less importance as measurement 
devices. 


INDIVIDUAL VERSUS GROUP TESTS 


Tests can be given to each individual separately, or a group of indi- 
viduals can be tested at one time. Although this ‘is a consideration for 
tests of all kinds, it is a particularly important issue in the use of general 
intelligence tests. The multifactor batteries are almost always group tests. 
There is considerable competition between individual and group tests of 
general intelligence. Numerous articles have appeared in the psycho- 
logical literature arguing whether group or individual tests should be used 
with particular age groups. 

Individual Tests. The 


‘ first practical general intelligence test, the Binet- 
Simon scale (9), 


was administered individually. This was necessary 
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because the subjects were young children. The individual test requires a 
highly experienced examiner. The examiner is in essence a part of the 
indardized testing procedure, and he must standardize his own treat- 
ent of the child to conform to established methods. It is not necessarily 
examiner's job to get the “best” score from the subject but to obtain a 
ore which is the same as would be found by other expert examiners. If 

e examiner is more or less friendly and encouraging than another, this 

ill inevitably lower the test reliability. 

\lany of the items on individual tests cannot be scored unambiguously 
is right or wrong. Instead there may be a number of acceptable responses, 
ind different scores are often required to indicate the degree of correct- 
ness. The better-established individual tests go to considerable effort to 
specify just what will be considered a correct response and how much 
credit should be given for a response. The examiner must follow the estab- 
lished scoring procedures meticulously and not permit his subjective 
judgment of a child to influence the test results. Any idiosyncracies in 
scoring will make the test less reliable. 

Group Tests. As was discussed previously, the first practical group tests 
of general intelligence were developed for the Armed Forces during the 
First World War. Here again necessity was the mother of invention: the 
number of men to be tested required a quick and economical measure- 
ment device, and the individual test was unsuited for that purpose. 

Most group tests are sufficiently self-explanatory so that the test exam- 
iner need have little or no specialized knowledge of testing procedures. 
Fhe test forms are simply passed out to subjects; either they are allowed 
to work at their chosen rate or the examiner directs the subjects when 
to start and stop. 

Comparison of Individual and Group Tests. Although the following 
rules are not precisely correct for all group and individual tests, the rules 
are sufficiently general to offer a reasonable basis for choosing between 
the two kinds of tests in most situations. 

1. Individual tests are required with young children. Starting at the 
earliest age, there are no group tests that can be used with infants. Each 
infant must be examined separately, and any effort at standardization of 
procedures is difficult. With preschool children and children in the early 
grades of grammar school, it is usually necessary to use individual tests. 
Young children either cannot read at all or they lack the reading ability 
to take the self-explanatory group forms. As an additional factor, young 
children are highly distractable, and it is all that the expert examiner can 
do to keep one child working at the test materials. Young children are 
often not motivated to do well on tests, and it is only through the exam- 
iner’s careful, but standardized, encouragement. that a meaningful meas- 


ure can be obtained. 
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2. Group tests are generally more desirable in testing “normal” adults, 
The well-standardized group tests prove to be as good predictors as the 
individual tests when working with most teen-age and adult groups, 
Although the evidence is clear that individual tests are better predictors 
for young children and that group tests do as well with adolescents or 
older, it is not certain which kind of test is generally more valid with the 
in-between age group. 

Adolescents and older persons are usually motivated and attentive 
enough to manifest a meaningful score on a group test. Also, they are 
less embarrassed by the group testing situation than they would be in the 
face-to-face individual testing situation. 

Because group tests tend to be equally valid as individual tests when 
administered to adults, group tests are much to be preferred in practical 
work. Group tests are much less expensive and time consuming. It takes 
no more time to administer and score 100 group tests than it does to 
administer and score 1 individual test. 

3. Individual tests are often useful in clinical settings. The clinical psy- 
chologist can often learn considerably more from the individual test than 
the subject’s score would indicate, The child who appears dull in the 
classroom may be only hard of hearing. Another child may do poorly in 
school work because he wants to do poorly, giving wrong answers when 
he knows what is correct. The older adult may appear to be “demented” 
because he is discouraged and withdrawn. These things probably would 
not be found in group testing situations, but an experienced clinician can 
often use the individual testing situation to diagnose why an individual 
is performing poorly. i 

4, Group tests are usually easier to construct than individual tests. For 
the person who plans to construct a predictor test, an individual test 
should not be undertaken unless measurement specialists are available 
and it is planned to spend considerable time and money in test construc- 
tion. The difficulties in constructing test materials, standardizing the test, 
and particularly the setting of instructions for administration and scoring, 
make the development of an individual test a time-consuming, expensive 


job. Training test examiners for the individual tests creates additional 
expense. i 


VERBAL VERSUS PERFORMANCE TESTS 


There has been some confusion as to the difference between “verbal 


tests and “performance” tests and the distinction is itself somewhat mis- 
leading. The following outline of verbal components in tests is offered as 
a basis for discussing test content: 

Verbal requirements 


1. Understand spoken language 


General Intelligence Tests 195 


. Understand written language 
. Speak language 
. Write language 

5. Verbal comprehension factor 

There are many different combinations of the above verbal require- 
ments in particular tests. A test may require the first four items in the list, 
but deal with language at so simple a level that very little ability in verbal 
comprehension is required to obtain a high score. It is possible to make 
up a test in which almost none of the five aspects is required. This can be 
done by giving the test instructions in pantomime and using test materials 
that require neither written nor spoken responses. It is also possible to 
compose a test in which none of the first four aspects is present and the 
fifth aspect, verbal comprehension, is a cardinal requirement. If the test 
requires the child to manipulate abstract symbols or to deal with pictures, 
this will tap, in part, the verbal comprehension factor, Each test should 
be examined in terms of its combination of verbal requirements rather 
than simply classified as “verbal” or “nonverbal.” 

Another distinction can be made between tests in terms of the way in 
which responses are made: 

Nature of response 

1. Symbolic response. The subject indicates the correct answer either 
through the use of language or by marking one of a number of 
choices. The symbolic response might be made in respect to 
objects rather than printed materials, although this is usually not 
done. 

2. Manipulative response (performance). The subject is required to 
handle objects in such a way as to complete a specified product. 
The product may be anything from a completely finished piece of 
machinery to the arrangement of a set of blocks. 

Some items are not clearly differentiated in terms of the two kinds of 
responses. For example, in maze tracing, the child is required to coordi- 
nate the pencil and move to the goal—the response is as symbolic as it is 
manipulative. 

It has been the custom to call instruments “performance” tests if they 
deemphasize language requirements, employ three-dimensional materials, 
and require manipulative responses. Because of these components, the 
“performance” tests usually measure motor coordination, speed, and per- 
ceptual and spatial factors. Instruments are usually referred to as verbal 
tests if they are printed forms, emphasize verbal comprehension, and 
require symbolic responses. Because of the ease with which certain kinds 
of test materials can be placed on printed forms, the “verbal” tests tend 
to measure the verbal comprehension, numerical computation, and the 
reasoning factors. There is no clear-cut separation between the factors 
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found in “verbal” and “performance” tests, but there is a tendency for 
different factors to arise in the two kinds of materials. 


THE BINET TEST AND ITS FOLLOWERS 


The 1905 Binet-Simon test (see 33) consisted of the following 30 items: 


OCONABMAUA Pp 


30. 


. Follow a lighted match with head and eyes. 

. Grasp a cube placed on the palm. 

. Grasp a cube held in line of vision. 

. Make a choice between pieces of wood and chocolat: 


Unwrap chocolate from paper. 


. Execute simple orders. 

. Touch head, nose, ear, cap, key, and string. 

. Point to objects which experimenter names in picture. 

. Name objects pointed out in a picture. 

. Judge which of two lines is the longer. 

. Repeat immediately three digits read by examiner. 

. Judge which of two weights is heavier. 

. Solve problems that embody novel, ambiguous, ov contradictory 


solutions. 


. Define house, horse, fork, and mamma. 
. Repeat sentence of 15 words after a single hearing. 
. Describe differences between: paper and cardboard; fiy and butter- 


fly; wood and glass. 


» Memorize 13 pictures of familiar objects. 

. Memorize immediately two designs. 

. Memorize immediately lists of digits. 

. State the similarities between: blood and wild poppies; fly, butter- 


fly, and flea; newspaper, label, and picture. 


. Judge differences in length of lines. 

. Rank order weights of 3, 6, 9, 12, and 15 grams. 

. Determine which weight the examiner removes from No. 22. 

. Make rhymes. 

- Complete sentence from which one word has been deleted. 

. Construct sentence which contains the words Paris, gutter, and 


fortune. 


. Describe the best action to take in 25 real-life situations. 
. Tell what time it would be if the hour hand and minute hand were 


interchanged at 3:57 and at 5:40. 


- Make a drawing of how the holes in a folded piece of paper would 


look if the paper were unfolded. 
Describe what is different between liking and respecting, and 
between being sad and being bored. 


The list of questions was tried on about fifty children and provisional 
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ms were established (see 33). It was found that the first five items 

could be passed by idiots and normal two-year-olds. Most three-year-olds 

‘jd go no further than about the ninth item. Most five-year-olds could 

io further than about the fourteenth item. 

\fter considerable use of the 1905 scale, a revision was published in 
10S (see 33). Items in the earlier scale which did not seem to work well 
ore eliminated and new ones were added. The outstanding feature of 

the 1908 revision was the grouping of items by age. A group of items was 
chosen to represent each age level so that about half the children at the 
wwe level could answer the items correctly. The level at which the child 
missed not more than one of the items was called his basal age. To the 
basal age was added one year for each five items he answered correctly 
it higher age levels. The basal age plus the years’ credit for items an- 
«wered at higher levels constituted the child’s mental age. The mental age 
vrrived at in this way was but a crude approximation of the child’s intel- 
lectual standing. The number of tests at different age levels varied from 
three to eight, and this was not taken into account when figuring the 
mental age. No credit was given for fractional “years” on the scale, the 
mental ages being restricted to whole numbers. As a more general criti- 
cism, the selection of items and the determination of age levels were based 
in a rather small amount of standardization research. 

American Adaptations of the Binet-Simon Scales. During the first dec- 
ade of this century the Binet-Simon scales became very popular in 
America. Instruments of this kind were needed in the care of the feeble- 
minded, in educational research, and in the understanding of juvenile 
delinquency. Goddard (19) made an early adaptation of the scales, prin- 
cipally translating them into English and making minor revisions. Later 
Kuhlmann (26) revised the scales and made an effort to extend the test 
downward to the age of three months. 

The Terman Revision of the Binet-Simon Scales. The first extensive 
revision of the Binet-Simon scales was undertaken by Terman and his 
colleagues at Stanford University. The new scale, published in 1916, was 
referred to as the Revised Stanford-Binet (46). The revision was so exten- 
sive as to constitute essentially a new test. Over one-third of the items in 
the revision were new and a number of items remaining from the original 
form were altered and moved to different age levels. 

For the first time, considerable empirical work went into the standardi- 
zation of the Binet type scales. The test was standardized on approxi- 
mately 1,000 children and 400 adults. An effort was made to select a 
sample of persons that would approximately represent the population of 
the United States. Although the standardization sample left much to be 
desired by modern standards, the effort to obtain a representative group 


of people set a milestone in psychological measurement. 
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In the 1916 revision, careful attention was given to the writing of 
instructions for test administration and scoring. The /O type of scoring 
was employed for the first time in the Binet series of tests. Because of the 
careful work that went into the revision, the test became the standard 
clinical instrument for measuring intelligence. It was pri bably used more 
frequently than any other test during the twenty years after it was 


published. 

The 1937 Revision. The second revision of the Stanford-Binet (47) was 
a ten-year undertaking. Instead of developing only one form, two equiva- 
lent forms, Form L and Form M, were produced. They are so different 
from the 30 items with which Binet began the series that it is only out of 


courtesy to Binet that his name is linked to the 1937 tests. 

Each of the new tests was longer than the 1916 test; different kinds of 
test items were used, and an even more strenuous effort at standardization 
was made. A large number of items was tried out on groups of children 
and adults. In all, about 1,500 persons were used in the empirical tryout 
of items. The principal criterion for selecting items was that they each 
have a “steep age gradient.” That is, an item is desirable for the six-year 
level if few of the five-year-olds get the item correct, most of the seven- 
year-olds get it correct, and about half of the six-year-olds get the item 
correct. In order to find sets of items for the different age levels, a consid- 
erable amount of testing and statistical analysis was undertaken. 

After the selection of items and the provisional construction of Form M 
and Form L, the forms were administered to a carefully selected sample 
of the United States population. The sample included 3,184 white, native- 
born persons from the ages of 1% to 18 years. An equal number of boys 
and girls was tested at each age level. The sample was drawn from 17 
communities in 1] widely separated states. An effort was made to dis- 
tribute the subjects in each age level equitably across the socioeconomic 
levels. Although the sample was not entirely representative of the United 
States population, there has never been a more strenuous effort to create 
a well-standardized test, 

The results obtained from the large sampling of responses to Form L 
and Form M were again item analyzed. The items in each form were 
rearranged in such a manner that the mean IQ at each age level is very 
close to 100 and the mean TQ for boys and girls is nearly identical. An 
effort was also made to obtain equal standard deviations for the IQs at 
each age level, However, the researchers were not entirely successful in 
equating standard deviations, 

The Stanford-Binet test is still used as widely or more widely than any 
other clinical instrument. There are hundreds of articles in the psychologi- 
cal literature on the use of the test, on developments in testing procedure, 
and on reliability and validity. 
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\dministration of the Stanford-Binet. The test materials (see Figure 
10-1) include a box of performance items (beads, toys, pictures, etc.), 
test blanks on which the child’s responses are recorded, and a manual of 
instructions (47). As was mentioned previously, the administration of an 
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Ficure 10-1. Some of the test materials used with the Stanford-Binet. (Courtesy of 
Houghton Mifflin Company. ) 


individual test of this kind requires a highly trained examiner, and no 
faith should be placed in the scores obtained by an amateur tester. The 
test can usually be given in a period of fifty to seventy-five minutes. 

No individual is administered all of the items. Instead, the individual is 
started slightly lower in the age scale than he is expected to reach. For 


TABLE 10-1. COMPUTATION OF IQ For A CHILD OF SEVEN YEARS AND SIX 
Montus (six items at each of the age levels) 


Test-year level Number of items passed Months’ credit per item Years’ credit 


VII 6 

VIII 6 (basal age) 8 
IX 4 2 % 
x 2 2 % 
XI 0 (maximal age) 


Total: 9 years 


si 40S a M 
9.0 
Mental age = 9.0 Chronological age = 7.5 10 = (35) 100 = 120 
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example, a typical procedure would be to start a seven-year-o)! off on 


the five-year-level questions. The child is then taken up through the age 
levels as far as he can go. The highest level at which he gets all of the 
items correct is called the basal age. The level at which the suij oct gets 
none of the items correct is called the maximal age. The subject's mental 
age equals his basal age plus specific extra credits for items correct above 
the basal age. The IQ is then obtained by dividing mental age by chrono- 


logical age and multiplying the result by 100 (see Table 10-1). 

The Content of the Stanford-Binet. The present test still bears inany of 
the earmarks of the original Binet emphasis on cultural artifacts: words, 
objects, and knowledge of the cultural environment. A knowl: dge of 


words and their usage plays a predominant part in the scale, particularly 
at the upper levels. Although a number of items of the performance type 
are used at the earlier ages (see Figure 10-1), these are also primarily 
concerned with the child’s recognition and use of common o eryday 
objects. Memory items are found throughout the scale, particularly in the 


immediate recall of digit series. A few items concerning percepival and 

spatial abilities are found at different age levels. Many of the iteras con- 

cern one or more kinds of reasoning ability. To illustrate the nature of the 

scale, the items at four different age levels are described as follows: 
Two-year level 

1. Three-hole form board. The child is shown a form board contain- 
ing a cut-out square, circle, and triangle. The pieces are removed 
and the child is asked to put them in their places. The child re- 
ceives credit if all three objects are put in place. 

2. Identifying objects by name. The child is shown six toy objects 
(button, spoon, etc.) and is asked to identify them. Credit is 
received for correctly naming four of the six objects. 

8. Identifying parts of body. The child is shown a large paper 
doll and is asked to point to parts of the body. (“Show me the 
dolly’s hair,” etc.). Credit is received for correctly pointing out 
three of the parts. i 

4. Block building. A box of blocks is placed before the child. The 
examiner builds a tower of four blocks and asks the child to do the 


Ta Credit is received if the child makes a tower of at least four 
ocks. 


5. Picture vocabulary. The child is shown 18 cards containing pic- 


tures of common objects. With each picture he is asked, “What is 
this?” Credit is given if the child names at least two of the objects. 

6. Word combinations, The examiner notes the spontaneous speech 
of the child during the test. Credit is given if the child uses at 
least a two-word combination such as “see kitty.” 
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Six-year level 
1. Voccbulary. The child is asked the meaning of words from a 
graded list of 45 terms. Credit is given for five correct definitions. 
(The same list of words is used throughout all higher age levels. ) 


2. Copuing a bead chain from memory. The examiner makes a chain 
of seven beads using alternately a square and then a round bead. 
The child is asked to make a bead chain like the examiner’s. Credit 
is given if all beads are placed correctly. 


3. Mutilated pictures. The child is shown five pictures in which an 
object has a missing part, e.g., a wagon with only three wheels, 
and asked to say what is missing in each. Credit is given for 
getting as many as four of the problems correct. 

4. Number concepts. Twelve blocks are put in front of the child. He 
is asked to give different numbers of the blocks to the examiner. 
Credit is given for selecting three correct numbers of blocks out 
of four trials. 

5. Pictorial likenesses and differences. The child is shown six cards 


with a number of figures on each. On each card one figure is 
different in some way from the others. The child is asked to point 
to the figure which is different. Credit is given for five correct 


responses out of six. 

6. Maze tracing. The child is shown three designs each of which 
shows two ways for a person to get home. One route is longer 
than the other. The child is asked to trace the shorter route. Credit 
is given for two correct responses out of three. 

Ten-year level 

L Vocabulary. The child is asked to define words from the standard 
list. Credit is given for 11 or more correct definitions. F 

2. Picture absurdities, II. A picture is shown in which an individual 
is acting in an illogical manner. The child is asked, “What is 
foolish about that picture?” Credit is given if the child can ex- 
plain what is illogical. . 

3. Reading and report. The child is asked to read a selection of 
material. He is then asked to recall as much as possible of what 
he read. Credit is obtained if there are not more than two errors 
in reading and at least ten events are recalled. ; 

4, Finding reasons, I. The child is asked to explain why two social 
rules are necessary, e.g, “Give two reasons why children should 
not be too noisy in school.” Credit is given for supplying two rea- 
Sons. 

5. Word naming. The child is asked to name as many words as he 
can in two minutes. Credit is given for 28 words or more. 
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Repeating six digits. Six digits such as 4, 8, 2, 1, 6, 3 are read at 
one-second intervals. The child is asked to repeat the digits in 
their exact order. Credit is given if one or more complete series 
out of three are recalled correctly. 


Average adult level 


ae 


2. 


With each of the ma 
described in the remain 
The comments which 
basis for selecting tes 
tained, the rel 
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Vocabulary. The subject is asked for definitions of words in the 
standard list. Credit is given if 20 or more are defined. 

Codes. The subject is shown a short sentence and a coded version 
of the sentence. The code consists of interchanging certain letters 
of the alphabet. The subject is asked to figure out the code and 
then write the word “hurry” in the code. Next, another similar 
item is given. Credit is allowed if 114 of the two codes are mas- 
tered. 


. Differences between abstract words. The subject is asked to dis- 


tinguish between three pairs of associated words, e.g., poverty and 
misery. Credit is given if two or more of the distinctions are 
correct. 


. Arithmetical reasoning. The subject is asked to solve three prob- 


lems; e.g., “If two pencils cost 5 cents, how many pencils can you 


buy for 50 cents?” Credit is given for two or more correct solu- 
tions. 


. Proverbs, I. The subject is asked to explain the meaning of three 


proverbs, e.g., “A burnt child dreads the fire.” Credit is given for 
two or more correct interpretations. 


. Ingenuity. Three problems are given. An example is, a boy is sent 


to the river to get exactly three pints of water, and he has only 
a 7-pint container and a 4-pint container. How can he measure 
the water? Credit is given if two problems are solved. 


- Memory for sentences, V. The subject is asked to recall a sentence 


after one hearing, Credit is given if at least one of two sentences is 
exactly recalled. 


- Reconciliation of Opposites. The subject is asked in what way 


Pairs of opposing terms are alike, e.g., sick and well. Credit is 
given for three correct answers out of six. 


ANALYSIS OF THE STANFORD-BINET 


jor tests and measurement methods which are 
der of the book, a critical analysis will be given. 
are made should not be relied upon solely as the 
ts for specific purposes. When tests are being ob- 
evant research literature should be studied (see Chapter 
tical analysis will provide an opportunity to utilize the 
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principles and methodology of measurement which were presented in 
the first half of the book. In criticizing particular measurement methods, a 
careful distinction should be made between how well an instrument 


measures up to ideals and how well it works in comparison with other 
procedures. in many places we will see that a measure which is crude 
in terms of ideal methods turns out to be much better than the even 
cruder procedures which might otherwise be used. 


What Does the Stanford-Binet Measure? One obvious feature of the 
_ Stanford-Binet is that it is not a test in the sense in which the term is 

customarily used, Because different persons take different items in terms 
of their age, scores are not directly comparable from age to age. Some 
types of items are used over a wide age range, particularly the memory 
for digits and vocabulary. Other types of items appear in only one or 
several age levels, such as maze tracing, form board, and the interpreta- 
tion of proverbs. 

One of the primary criteria for the selection of items was the correlation 
of items with chronological age.’ This seems to be a reasonable proce- 
dure in thai whatever “intelligence” connotes to people it is thought to 
grow as the child grows. It is, however, by no means safe to assume that 
anything that increases with age is necessarily an index of intelligence. 
Children of twelve years not only have larger vocabularies than eleven- 
year-olds, their feet are longer also. No one would consider using the 
length of a person’s foot as a measure of intelligence. The constructors of 
the scale used not only “age differentiation” to select items, they used 
their intuitions as well. As Binet had done, they selected items which 
looked as though they related to intelligence, as it is popularly conceived, 
and then weeded out items which failed to show the required statistical 
characteristics. 

The Factor Structure of the Stanford-Binet. Looking more precisely at 
what the Stanford-Binet measures, different investigators have found the 
test to embody at least several factors. This is so in spite of the effort to 
select items which correlate highly with one another. McNemar (29) re- 
ported the results of 14 factor analyses of the Stanford-Binet items. He 
found that a first factor at each age level accounts for from 35 to 50 per 
cent of the item variance. It was his interpretation that the first factor 
is much the same at different age levels. A second factor was found to 
explain from 5 to 11 per cent of the item variance, and a third factor was 
found to account for from 4 to 7 per cent of the item variance. Taking 
these results at face value, they seem to give supporting evidence for the 
existence of one major factor in the Stanford-Binet. Several points, how- 
ever, need to be examined before this conclusion is reached. If only one 


factor were found in the items, this would not mean that “intelligence” is 


* Equally important was the correlation of each item with total score on the test. 
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general but only that the items in the Stanford-Binet had been chosen in 
such a manner as to represent only the common ground o human abil- 
ities (McNemar had not tried to show otherwise by his analyses), 
McNemar performed his analyses in such a way as to make the most of 
the common ground in the items. He showed that a genera! factor can 
account for much of the item variance. If one tries, however. to search 
for factors in the Stanford-Binet rather than minimize them, it can be 
shown that a number of factors are present. A thorough factor analysis 


of the items would need to include an extensive array of the kinds of tests 
which have led to the discovery of known factors. Also, McNemar’s con- 


clusion that the first factor obtained from the analyses is the same for 
different age levels is open to question. 
Jones (24) offers a different interpretation of the factor structure of 


Stanford-Binet items. He performed separate factor analyses o! the items 
for children at the ages of 7, 9, 11, and 13 years. Each analysis employed 
100 boys and 100 girls. Jones made an effort to find the com lete factor 
structure of the items. Like McNemar, he used only the Stai‘ord-Binet 
items in the analyses, without the inclusion of reference tests of known 


Ficune 10-2. Names and relative influence of factors reported at severa! age levels. 
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factor composition. He found a number of separable but correlated 
factors «t each age level. His results are summarized in Figure 10-2. 
Facto: analyses of items usually give less clear-cut results than factor 


analyses of whole tests. Jones's factors, therefore, are not clearly inter- 
pretab! Aside from the difficulty of interpreting all of the factors clearly, 
his anz!yses strongly indicate that the factor structure is not the same at 
all age \evels. For example, although what appear to be reasoning factors 
occur ul ages 7, 9, and 18, nothing of the kind was found in the 11-year 
age group. The verbal factors which predominate at the four age levels 
are less important at some ages than others. 

The answer then to whether the Stanford-Binet embodies only one and 
the same factor at different age levels is both yes and no. Much of the 
item variance can be accounted for by one factor. The one major factor, 


however, is not quite the same at different age levels, which makes the 
interpretation of IQ scores somewhat difficult, When an effort is made 
to isolete the indicators of more than one factor as Jones did, it is ap- 
parent that the traces of a number of other factors can be found. 

The Scoring System. The Stanford-Binet is one of the few careful at- 
tempts at the construction of an age scale. Although numerous other tests 


employ the mental age and the IQ, few of them meet the statistical re- 
quirements. A successful age scale should have a mean IQ of 100 and 
equal standard deviations of IQs at different age levels for some defined 
population of persons. In order for the standard deviations of IQ scores 
to remain constant, it is necessary for the standard deviation of mental 
age scores to increase by regular amounts from one age level to the next. 
This is an almost impossible thing to accomplish in test construction. The 
standard deviation of IQ scores on Form L varies from 20.6 at age 2% 
to 12.5 at age 6 years. Special correction tables (29) had to be con- 
structed to make IQ scores comparable at different age levels. 

The great difficulty that is met in achieving the required statistical 
properties of an age scale makes it doubtful that an age scale is worth the 
trouble, The IQ concept bears the unwarranted surplus connotations that 
the general intelligence tests assess intelligence and that a ratio scale is 
available for its measurement. An equally logical and much simpler 
procedure is to use standard scores instead of IQ units. The mental age 
concept has been clung to by clinicians and school workers because of its 
practical meaning and the ease with which it can be explained to laymen. 

The Testing of Adults. On both subjective and empirical grounds there 
is reason to believe that the Stanford-Binet is a better test for children 
than adults, One reason for this is that the concept of general intelligence 
is more meaningful with children than adults. This point will be discussed 
later in the chapter. The Stanford-Binet does not have a high enough 
“ceiling” to measure the ability of superior adults. That is, the range of 
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difficulty at the adult level is not sufficient to tap the ability of highly 
gifted individuals. Because no persons over eighteen years of age were 
included in the standardization sample, the norms for adults are suspect. 
In addition, there is a logical difficulty in using the IQ concept with 
adults. Because the mental age for an individual stops growing in the late 
teens, it no longer makes sense to obtain the IỌ by dividing mental age 
by chronological age. If this were done, the IQs of adults would con- 
stantly drop. What actually occurs in using the test is that for any 
individual who is sixteen years or over, the chronological age is considered 
to be fifteen years. This results, however, in a number of complicated 
corrections that must be made to obtain the 10 (47). Even worse than 
the statistical difficulties in assigning IQs to adults, the 1Q is a meaning- 
less measure with adults. For example, if a twenty-year-old person has 
an IQ of 150, this does not mean that he has the intelligence of an average 
thirty-year-old. 

Reliability. Very careful research was undertaken to determine the 
reliability of the Stanford-Binet IQ at different age levels (29). An 
equivalent-form reliability estimate was made separately for each age. 
Correlations were found between the scores obtained on Form L and 
Form M administered to the same subjects within one week’s time. In 
general, the findings show that the Stanford-Binet is a highly reliable 
scale, with most of the reliability coefficients equal to or greater than .90. 
Scores tend to be more reliable for persons in their teens than they do for 
young children. The studies also show that low scores are somewhat 
more reliable than high scores in each age range. In other words, a bit 
more faith can be placed in the precision of a very low score than in a 
very high score. 

Predictive Efficiency of the Stanford-Binet. The test has shown itself 
to be a good predictor of different criteria, particularly of school grades. 
In general, the findings have been that Stanford-Binet IQs correlate in 
the neighborhood of .70 with grammar school grades, .60 with high 
school grades, and .50 with college grades (3, 42). (The decline in valid- 
ity is probably due to the progressively decreasing dispersion of intellec- 
tual ability.) The following correlations were found between Form L IQ 


and high school achievement test scores (10). The number of students 
ranges from 78 to 200. 


Reading comprehension 73 Spelling -46 
Reading speed 43 History 59 
English usage 59 Geometry 48 
Literature acquaintance 60 Biology 54 


Clinical Utility. Presumably, a test like the Stanford-Binet should be 
judged, in the long run, by the success with which it predicts different 
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criteria. However, many of the uses to which this and other general in- 
telligence tests are put are so subtle as to make direct empirical validation 
difficult. For example, the test might be used to decide what type of 
psychotherapy should be used with a disturbed child. Because of the 
difficulty in measuring therapeutic success and the difficulty in deciding 
the importance of “intelligence” for the outcome, it is hard to determine 
how well the test works in the situation. In many situations only the 
clinical impression of how well the test performs a particular job can at 
present be used as an indication of “validity.” Judging from the wide 
acceptance of the test in clinical settings, it is apparent that the Stanford- 
Binet is judged to be as valuable or more valuable than any other test 
for use with children. 


THE WECHSLER INTELLIGENCE SCALES 


David Wechsler, working at the Bellevue Psychiatric Hospital in New 
York, developed an individual test which was intended to differ from the 
Stanford-Binet in the following respects: 

1l. The test items were to be more appropriate for adults, and repre- 
sentative norms for adults were to be obtained. 

2. Age levels were to be discarded in favor of a number of subtests 
which all subjects would take. 

3. Separate sets of verbal and performance tests were to be con- 
structed, allowing for both a verbal and performance IQ. 

4. The IQ was to be determined by a transformation of standard scores 
rather than through the use of mental age scores. 

5. The test would be constructed to fit clinical conceptions of “in- 
telligence” rather than in terms of “age differentiation.” 

The Wechsler-Bellevue. The first test in Wechsler’s series was presented 
in 1939 (see 52). Like the Stanford-Binet, the logic for the test is mainly 
based on the somewhat misleading concept of “general intelligence” and 
on rational rather than empirical validity. Like his predecessors in the 
measurement of general intelligence, Wechsler begins with a definition: 
“Intelligence is the aggregate or global capacity of the individual to act 
purposefully, to think rationally and to deal effectively with his environ- 
ment” (52, p. 3). Wechsler (52), as late as 1944, held to the belief that 
one common factor underlies intellectual functions and cited Spearman's 
work as support for the belief. Very few persons would share that belief 
today, 

The subtests for the Wechsler-Bellevue were selected by combing the 
test literature for tests that had worked well in practice. After some pre- 
liminary research, 12 of the tests were selected for special study and 
administered to over a thousand subjects. One of the tests, Cube Analysis 
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(see 52), proved to be difficult to explain to subjects and large sex differ- 
ences were found. The other 11 tests were found to have desirable ranges 
of difficulty and practicality of administration. Brief descriptions of the 
11 subtests in the Wechsler-Bellevue are as follows: 

Verbal Scale 


1. 


General Information. The subject is asked 25 questions concern- 
ing a wide variety of facts. The questions are not intended to tap 
academic training or specialized branches of knowledge. They 


are meant to cover the kinds of information that any alert indi- 
vidual can learn from his cultural contacts. 


. General Comprehension. The test contains 10 items concerning 


why certain social rules are necessary and how everyday prob- 
lems are solved. 


. Arithmetical Reasoning. Ten problems of the kind that would 


be typically encountered in elementary school arithmetic are 
given. Both speed and correctness of response are scored 


. Digit Span. This is the familiar memory for digits which also 


appears at different levels of the Stanford-Binet. From three io 
nine digits are read to the subject, and he is asked to repeat them 
in their exact order. In the second part of the test, the subject is 
asked to repeat the digit series backwards. 


. Similarities. The subject is asked to tell what is similar about 


12 pairs of terms. This subtest is also very similar to material 
found on the Stanford-Binet. 


- Vocabulary. The subject is asked the meanings of 42 words. 


Although in the original plan of the Wechsler-Bellevue the yocab- 
ulary test was included only as an alternate, it has worked so well 


in iets that it is usually given as a regular part of the verbal 
scale. 


Performance Scale 
7. Picture Completion. The subject is shown 15 incomplete pictures 


10. 


and asked to describe the missing part in each, This is also very 
much like material found on the Stanford-Binet, 


. Picture Arrangement. The subject is handed a set of pictures and 


asked to arrange them in an order that tells a story. Six sets of 
pictures are given. Both speed and accuracy are scored. 


. Object Assembly. The subject is asked to put together three jig- 


saw puzzles. Each puzzle pictures some part of the human body. 
Both speed and accuracy are scored. 

Block Design. The subject is shown a set of small blocks. Sur- 
faces of the blocks are painted white, red, and red-and-white. 
The subject is presented with a picture of a design and asked to 
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reproduce it with the blocks. Seven designs are given in turn. 
Both speed and accuracy are scored. 

11. Digit Symbol. This is an adaptation of the familiar coding test. 
The subject is given a sheet of paper on which nine symbols are 
paired with nine numbers. Farther down on the page a jumbled 
list of the numbers is given and the subject is asked to write in 
the matching symbols. 

Administration and Scoring of the Wechsler-Bellevue. Like most indi- 
vidually administered general intelligence tests, only an expert examiner 
should be trusted to give the Wechsler-Bellevue. The test is somewhat 
easier to administer than the Stanford-Binet. The subject is given each 
of the subtests in turn. The individual's raw score for a subtest indicates 
how many items he answered correctly and/or how quickly the task was 
completed. Tables are available (52) for converting raw scores on each 
subtest to transformed standard scores. The transformed standard scores 
have » standard deviation of 3 and a mean of 10. The means and standard 
deviations were obtained from a normative sample of 1,751 persons 
between the ages of seven to seventy years. 

Three separate IQs can be found from the tests. The six verbal tests are 
added together to form one IQ score. Similarly, the five performance tests 
can be added to give another 1Q. Then, the two subscales can be com- 
bined to form one over-all IQ. The IQs are simply transformed standard 
scores, In this case, the transformation at each age level was done in such 
a way as to have the usual mean IQ of 100 and a standard deviation close 
to that of the Stanford-Binet.? Because it was found that raw scores tend 
to gradually decline in going from people of twenty-five to seventy years 
of age, the JỌ is determined about the mean raw score for each age group. 

Analysis of the Wechsler. The Norms. All of the subjects in the stand- 
ardization sample were obtained from the State of New York, the majority 
coming from New York City. The 1,751 persons used in the normative 
sample were selected from 3,500 cases in such a way as to represent the 
occupational distribution of white citizens in the United States. There 
were considerably more men than women in the sample. Although the 
effort to obtain a representative sample was more systematic than that for 
many other available tests, the derived norms are likely to be somewhat 
misleading when applied to the country as a whole. 

It is an open question whether the IQ should be determined about the 
mean of each age group as is done on the Wechsler-Bellevue or about an 
over-all mean. The IQ as determined from the Wechsler-Bellevue has the 


? Because the standard deviation of Wechsler-Bellevue IQs is slightly smaller than 
that of the Stanford-Binet, IQs obtained from the two tests are not directly 
comparable. 
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advantage of not “penalizing” persons of middle age and over. (The rela- 
tionship between age and intelligence will be discussed in the latter part 
of this chapter.) In the absence of definite evidence showing that “general 
ability” as well as test scores decline, the safest procedure is to center 
scores about the mean at each age level. 

The Test Content. It is obvious from an inspection of the subtests that 
much of the material is similar to other tests and particularly to the 
Stanford-Binet. Studies of unselected adolescent and adult groups have 
generally found correlations of .80 (36, 51, 52) and higher between the 
Wechsler-Bellevue and the Stanford-Binet. The similarity, however, in 
content of the two tests, does not distract from the purpose of the 
Wechsler-Bellevue. It was Wechsler’s intention to select a more proper 
range of difficulty in adult items, not necessarily to invent entirely new 
kinds of test materials, 

Davis (14) factor analyzed the 11 Wechsler-Bellevue subtests along 
with reference tests for some of the known factors of human ability. He 
used a sample of 202 eighth-grade pupils in the analysis. Davis found evi- 
dence of a number of factors in the subtests: verbal comprehension, visu- 
alization, numerical facility, mechanical knowledge, general reasoning, 
perceptual speed, and eduction of conceptual relations. The conclusion to 
be reached from this and other factor-analytic studies (1, 5) is that the 


Wechsler-Bellevue is factorially complex, apparently more so than the 
Stanford-Binet, 


This is an argument that could be solved straightforwardly if the worth 
of the test were entirely determined by how well it predicts stated criteria. 
In general, peformance tests tend to be somewhat less reliable than 


the verbal tests, 


Reliability. The evidence is that the total test IQ has a reliability com- 
parable to that of the Stanford-Binet, in general being about .90. As is 
always the case, the individual subtests are less reliable. In a group of 
normal adults, retest reliability coefficients for the subtests ranged from 
62 to .88 (15), Thus, although the total test IQ is highly reliable, marked 
: i e of the subtest scores. 

Profile Analysis. It has been a common clinical practice to use the sub- 
tests of the Wechsler-Bellevue in profile analysis to detect different clini- 
cal disorders. The underlying assumption is that different types of pathol- 
ogy manifest themselves in different score patterns. Wechsler hypothesized 
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(52, p. 150), for example, that the following variations in subtest scores 
characterize the schizophrenic patient: 

Object Assembly much below Block Design 

Very low score on Similarities with high Vocabulary and Information 

Sum of scores on Picture Arrangement plus Comprehension less than 

sum of scores on Information plus Block Design 
Other profile patterns have been hypothesized for brain-damage cases 
and a number of other clinical types. 

Profile analysis is undertaken with the Wechsler-Bellevue in the situa- 
tion in which it is least warranted—insufficient reliability for the individual 
subtest and moderate to high correlations among the subtests. Sum- 
maries of the relevant research (36, 51) show that profile analysis efforts 
with the subtests have met with essentially negative results. Studies of 
the relationship between test profiles and other human attributes should 
be based on the results of well-standardized multifactor batteries, not on 
the subtests within general intelligence tests. 

Predictive Efficiency. Basing his test as he did on a rational approach 
to the measurement of intelligence, Wechsler (52) reports only a few 
specific correlations with performance criteria, and these are based on 
small numbers of cases. Other studies report correlations in the .40s and 
50s between Full Scale IQ and freshman college grades (3, 42). The 
separate Verbal Scale IQ had higher correlations than Full Scale JQ with 
grades; the Performance Scale proved to be a much poorer predictor of 
freshman grades, 


THE WECHSLER ADULT INTELLIGENCE SCALE (WAIS) 


To meet the criticisms which were made of the Wechsler-Bellevue, a 
revision of the test, WAIS, was published in 1955 (54). All the original 
subtests were retained but many new items were substituted for less 
satisfactory old ones. The instructions for administering and scoring the 
test have been revised. An effort was made to obtain a wider range of 
item difficulty in each subtest, which should provide more adequate test- 
ing of persons at both extremes of the ability range. The norms for the 
WAIS are much more representative of the United States population 
than were those for the Wechsler-Bellevue. A carefully balanced sample of 
1,700 persons was tested in 24 communities spread across the country. 
The sample resembles closely the United States population in terms of 
urban and rural, men and women, geographic region, and occupation. 
The test manual (54) is more frank, clear, and complete than was that for 
the Wechsler-Bellevue. Although it will be some time before the ultimate 
worth of the test can be determined, the indication is that the WAIS will 
be a better test than the Wechsler-Bellevue. 
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THE WECHSLER INTELLIGENCE SCALE FOR CHILDREN (WISC) 


Although the Wechsler-Bellevue provides norms for children down 
to the age of ten years, it is primarily an adult test. A separate test, the 
WISC (53), was constructed specifically for children, aged five to fifteen 


years. The general organization of the test is much like the parent form: 

Verbal Scale Performance Scale 

1. General Information 6. Picture Completion 

2. General Comprehension 7. Picture Arrangement 

3. Arithmetic 8. Block Design 

4. Similarities 9. Object Assembly 

5. Vocabulary 10. Coding (or Mazes ) 

Alternate: Digit Span 


The subtests are similar in most cases to those in the Wechsler-Bellevue. 
The tests, however, are easier and the materials are more dire: tly related 
to the child’s world. Digit Span can be used as an alternate to one of the 
five verbal tests. The examiner has the choice of using either Coding or 


TABLE 10-2, CLASSIFICATION OF IQs on THE WISC 
(Adapted from Wechsler, 53, p. 16) 


Description IQ ranges BES desta 
? children 
Very superior 130 and above 2.2 
Superior 120-129 6.7 
Bright 110-119 16.1 
Average 90-109 50.0 
Dull normal 80-89 16.1 
Borderline 70-79 67 
Mental defective 69 and below 2.2 


The WISC is a much more 


carefull standardized test than the 
Wechsler-Bellevue, Particular care A 


went into the gathering of norms. One 
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hundre«! boys and 100 girls were tested at each age level, giving a total 
of 2,200 children in the standardization sample. A strenuous effort was 


made to choose a representative cross section of white children in the 
United! States. The sample was drawn from 85 communities in 11 states. 
The distribution of subjects closely resembled the country at large in 
terms of urban-rural proportion, geographical area, and parental occupa- 
tion. The WISC standardization sample is as representative as that used 


in any current general intelligence test. 

Analysis of the WISC. Because no alternate form of the test is available, 
it was necessary to use split-half reliability estimates (53). The reliability 
was carefully studied at age levels 71⁄4, 10%, and 18% years. The Full 


Scale reliabilities at these age levels were found to be .92, .95, and .94 
respectively. For the Verbal Scale, the reliabilities were found to be .88, 
.96, and .96 respectively. As is usually the case, the reliabilities were 


slightly lower for the Performance Scale: .86, .89, and .90. In general 
these coefficients represent good reliabilities for the three 1Q scores. Here 
again the lesson is learned that high reliability of composite scores does 
not mean that subtest reliabilities are all high. Although most of the sub- 
test reliabilities are in the .60s, .70s, and .80s, some go as low as .50. This 
should be a warning to anyone who tries to consider profile differences 
among the subtest scores. It is safe enough to compare verbal IQ with 
performance IQ, and this should have diagnostic meaning for the child 
who suffers a language handicap, but no finer inspection of the differences 
among subtests can be trusted. 

Several studies found correlations between the Stanford-Binet and the 
WISC ranging from the .60s to the .90s. Correlations with the Stanford- 
Binet are lower with the WISC Performance Scale than with the Verbal 
Scale. The test is too new to know what predictive validity it will have 
for different performance criteria. Few criteria are available for children 
except school grades. Considering the appearance of the test materials, 
the reliability, and the correlations with other tests, it is to be expected 
that the test will show about the same predictive efficiency for school 
grades as that afforded by the Stanford-Binet. Although the test is cur- 
rently receiving generally good response from clinical settings, it takes a 
considerable amount of experience with a new instrument to know 
whether or not it has clinical value and practicality of use. 


GROUP TESTS OF GENERAL INTELLIGENCE 


Like the individual tests of general intelligence, the group tests are usu- 
ally composed of verbal comprehension, numerical computation, and vari- 
ous mixtures of the reasoning factors. Although the tests differ from one 
another in appearance and sometimes in their factor composition, they 
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tend to correlate highly with one another. At the teen-age and adult levels 
the group tests correlate highly with the individual tests, such as the 
Wechsler-Bellevue. We see the interesting result that people who started 
off in seemingly different directions to compose intelligence tests ended 


up with rather similar measures. Where the group tests differ from one 
another is in their practical advantages. Some are longer and thus more 
reliable than others. Some have obtained norms in a careful and repre- 
sentative manner; others have only scant or misleading information on 


norms. Some have either higher or lower “ceilings,” making them more 
useful with one or the other extreme of ability. Considerable research has 
been done with some of the tests, and only “face validity” can be claimed 
for others. 

The Binet and Wechsler scales dominate the field of individual tests of 
general intelligence. Among group measures neither one nor several tests 
have dominated the field. The tests, therefore, which will be discussed 
here are only examples of the measures available. 

Group Tests for Young Children. The youngest ages at which it has 
proved feasible to use group tests are the five- and six-year levels. Only 
small groups of approximately a dozen children can be tested in this way, 
and even then the examiner must exercise considerable skill to obtain the 
necessary cooperation and attention. Tests at this age level cannot employ 
written language and the child cannot be expected to write his own 
responses. Test instructions must be given orally and supported by illus- 
trations and gestures. 

One of the most widely used tests for young children is the Pintner- 
Cunningham Primary Test (34), which has been in use for over twenty 
years. The test is available in three equivalent forms, A, B, and C. Each 
form is composed of seven subtests which are added together to obtain 
one score. (Illustrative items are shown in F ig. 10-3.) 

Equivalent form reliabilities are found to be generally high for groups 
of kindergarten and first-grade children (34), ranging from .83 to 89. 
Correlations between the Pintner-Cunningham and the Stanford-Binet 
are usually about .80 (34). Ina group of 260 first-grade children, the 
Pintner-Cunningham correlated .63 with scores on a reading test. Other 
group tests which are suitable for use with young children are the 
Kuhlmann-Anderson Intelligence Tests (27) and the Otis Alpha (31). 

Group Tests for the Elementary School Level. As children progress 
through the elementary school levels, more and more written material can 
peer anes k tests. In the first several grades it is still necessary to rely 
a A ine T A eg and pictorial test materials. One of the oldest 
Test (22). The ` : ee eae school range is the National Intelligence 
ie fe st was developed soon after the First World War in the 

e of the successful use of group tests with Armed Forces personnel. 
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Ficure 10-3. Ilustrative items from Form A of the Pintner-Cunningham Primary 
Test, (Reproduced by permission of World Book Company.) i 


SO 


Test 1. Mark the things that mother uses when she sews her apron 


Test 3. Mark the two things thot belong together 


e e eee eee 
H e eee eee 
e e o oè eee 
Test 7. Look at how each picture is drawn; moke another one like 
it in the dots 


The test comes in two similar parts, Scale A and Scale B. Each scale 
has four verbal and one numerical subtest. A total score is obtained by 
adding the subtests together. Equivalent forms have been constructed for 
each scale, The test relies heavily on verbal comprehension, utilizing such 
subtests as verbal analogies, opposites, and sentence completion. The 
test was used for many years as a standard instrument; but because it has 
not been revised in a long time, it is now generally being replaced by 
more recently developed tests. Other tests which are used with elementary 
school children are the Kuhlmann-Anderson Intelligence Tests (27), the 
Otis Beta (32), Lorge-Thorndike Intelligence Tests (Houghton Mifflin 
Company ), Henmon-Nelson Tests of Mental Ability (Houghton Mif- 
flin Company ), and the California Test of Mental Maturity (45). 
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Group Tests for High School Students and Average Adults. Ger ral 


intelligence tests for teen-agers and above tend to rely heavily on verbal 
comprehension and general information, with only small inclusion of 
other factors. Some of the tests are designed primarily to predict future 
achievement in school. These are usually weighted heavily with ques- 
tions on general information and look much like achievement examina- 
tions. Other tests are designed primarily to test aptitude—not what the 
individual knows but his ability to learn. The two kinds of tests, however, 
tend to correlate highly. As was previously mentioned, tests of intelli- 


gence, or aptitude, tend to rely heavily on the individual’s knowledge of 
the culture about him; thus, they are not very different from test. of 
acquired knowledge. There are exceptions: persons who score consid- 
erably lower on tests of information than on intelligence tests; and for 
these persons, the difference in scores on the two kinds of tests is of 
diagnostic importance. 

One of the oldest group tests for teen-agers and average adults is the 
Army Alpha, which did so much to popularize group tests after the First 
World War. Civilian adaptations and revisions of the test ( 55) are still in 
use. The Pintner, Otis, and Kuhlmann-Anderson series, mentioned in 
respect to the testing of younger children, have forms with high enough 
“ceiling” for the testing of average adults. 

One of the leading tests for this age level is the Terman-McNemai Test 
of Mental Ability (48). It has seven subtests, all of which are predomi- 
nantly concerned with verbal comprehension. (See Figure 10-4 for illus- 
trative items.) Norms were carefully determined on a sample of persons 
from 200 communities in 37 states. Conversion tables are used to express 
scores in terms of percentiles, mental ages, and deviation IQs. 

Another test that is available for use at this age level is the Army Gen- 
eral Classification Test (AGCT ), which was used in the first years of the 
Second World War. The test is now available for civilian use (59). The 
test includes an equal proportion of verbal, numerical, and spatial items. 
The effort to include at least several factors in an intelligence test is a 
departure from the usual emphasis on verbal comprehension alone. 

Group Tests for College Students and Superior Adults. Tests in this 
category, like those in the next lower level, either emphasize aptitude or 
general information. As the higher levels of education and ability are 
reached, the creation of an adequate measuring device becomes more of 
a challenge to the test constructor. Generally it is much easier to compose 
adequate tests for idiots than for geniuses, and the difficulty usually 
mounts when going from the lowest to the highest extreme. General 
classification tests have been modestly successful in predicting college 
grades, with correlations in the .50s not uncommon. It has been a par- 
ticularly annoying thorn in the side of professional test builders that they 
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Ficuny 10-4. Sample items from the Terman-McNemar Test of Mental Ability. (Re- 
produced by permission of World Book Company.) 


Test 1. Information 


Mark the word that makes the sentence TRUE. 
Our first President was— 


\dams 2. Washington 3. Lincoln 4. Jefferson 5. Monroe 
Test 2. Synonyms 
Mark the word which has the same or most nearly the same meaning as the first 
word 


Correct 1. Neat 2. Fair 3. Right 4. Poor 5. Good 
Test 3. Logical Selection 


Mark the word which tells what the thing aways has or aLways involves. 
A cat always has— 


l. Kittens 2. Spots 3. Milk 4. Mouse 5. Hair 


Test 4. Classification 
In each line below, four of the words belong together. Pick out the oxe worp 


which does not belong with the others. 
l. Dog 2. Cat 3. Horse 4. Chicken 5. Cow 
Test 5. Analogies 


Hat is to head as shoe is to— 
l. Arm 2. Leg 3. Foot 4. Fit 5. Glove 


Test 6. Opposites 


Choose the word which is opposrre, or most nearly opposite, in meaning to the 
beginning word of each line. 


North 1. Hot 2. East 3. West 4. Down 5. South 


Test 7. Best Answer 


Read the statement and mark the answer which you think is best. 
We should not put a burning match in the wastebasket because: 
1. Matches cost money. 2. We might need a match later. 
3. It might go out. 4. It might start a fire. 


have generally met with only small success in predicting the more creative 
side of human behavior as it is manifested in graduate school and beyond. 

Among the most widely used tests at this level are those which are 
designed specifically for the selection of college freshmen. One of the 
best known of these is the American Council on Education Psychological 
Examination for College Freshmen (ACE) (56, 57), which was used for 
many years but has now been discontinued. A new form of the test was 
constructed each year for use with college applicants. Three of the six 
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subtests combine to form a linguistic score (L), and the other three form 
a quantitative score (Q). The L subtests consist of Verbal Analogies, 
Completion, and Same-Opposite. The Q tests consist of Arithmetical Rea- 
soning, Number Series, and Figure Analogies. Scores on parts Q and L 
were usually added together, and the composite used in the selection of 
students. The ACE has high reliability, with usual retest coefficients of 
around .90 (58). Other tests which are similar in purpose and solidity of 
construction to the ACE are the College Board Tests (60), the College 
Qualification Tests (The Psychological Corporation), and the Ohio State 
University Psychological Test, Form 21 (Ohio College Association ). 

The Miller Analogies Test (30) is an interesting effort to predict suc- 
cess in graduate school and in high-level professional work. The test con- 
sists of 100 verbal analogies ranging widely in difficulty. The analogies 
are cleverly constructed in such a way as to measure both verbal compre- 
hension and reasoning ability. The major advantage of the test is that it 
has a high enough “ceiling” to distinguish among persons who would 
make much the same score on the usual intelligence test. Split-half reli- 
abilities for the test are generally found to be over .90. The Miller Analo- 
gies Test has been used frequently to predict grades in graduate school. 
The predictive efficiency of the test varies widely in terms of the school 
and the academic specialty being studied. Validity coefficients tend to be 
about .50 on the average. Another test which is both well constructed and 
capable of differentiating at high levels of ability is the CAVD (49). 

Achievement test scores in college and course grades are generally as 
valid as the intelligence tests for predicting graduate school and profes- 
sional performance (some of the achievement tests which can be used for 
this purpose will be mentioned in Chapter 12). A combination of the 


three kinds of measures by multiple regression usually makes for more 
efficient prediction. 


GENERAL INTELLIGENCE TESTS FOR INFANTS 
AND PRESCHOOL CHILDREN 


A separate section has been reserved here for the discussion of infant 
and preschool tests because of the special problems that are involved. One 
of the pioneers in the testing of infants is Arnold Gesell. For over 20 years 
he and his colleagues have performed longitudinal studies of child devel- 
opment. A group of 107 infants was systematically observed at 4, 6, and 
8 weeks and at every 4-week interval to 56 weeks. The children were 
studied again at 18 months and at the ages of 2, 3, 4, 5, and 6 years. On 
the basis of these observations the Gesell Developmental Schedules (16) 
were prepared. They are intended to measure the following attributes: 

1. Motor behavior. How well the child can hold his balance, coordi- 

nate, stand, walk, and manipulate objects. 
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\daptive behavior. How well the child can solve the problems of his 
ll world: obtain objects, remove obstacles, solve puzzles, and 

ct to stimuli. 
nguage behavior. How well the child can communicate, using the 
rd in its broadest meaning, including the use of gestures and 
mitive words, to the later development of real language. 
rsonal-social behavior. How well the child learns habits of per- 
nal care such as toilet-training, dressing, and feeding himself. At 
later age consideration is given to how the child manages himself 
social situations and in play activity. 


During the first year of life, when the Gesell scales would supposedly 
have their unique value, most of the observations have to be made about 
motor behavior. The 4-week old infant cannot, of course, talk or follow 
oral instructions of any kind, The most that can be done at the infant 
stage is to watch the child’s spontaneous movements and note how he 
reacts to various stimuli. At 1.4 months the average child can coordinate 
his eves on an object held before him, At 3 months he will make reach- 
ing movements for an object. At 5.5 months he will react differently to 
strangers than to his parents. (See Figure 10-5 for illustrative test mate- 


Ficure 10-5. Test materials used with the Gesell Developmental Schedules. (Cour- 
tesy of The Psychological Corporation. ) 
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rials.) Other tests for infants are the Cattell Infant Intelligence Scale 
(11), the California First-Year Mental Scale (7), and the Northwestern 
Infant Intelligence Tests (18). 

Analysis of Infant Tests. Tests for infants are difficult to standardize, 
administer, and score. They are, of course, all individual measures Infant 
tests are less reliable than tests for older children. The reliability is con- 
siderably lower during the first six months than afterward. Several studies 
of different tests have found reliabilities around .65 for testing during the 
first six months (6, 11). After 6 months the reliabilities move up to 
respectable figures in the .80 to .90 range (6, 11). Except for the first 
weeks and months, the infant tests do measure something consistently. 
The question is, what do they measure? A real difficulty in validating 
infant scales is that there are almost no criteria available until the child 
enters school. A customary procedure has been to correlate infant tests 
with scores made several years later on more established intelligence tests 
like the Stanford-Binet. Studies (8) have shown that infant tests given at 
the age of one year or less correlate about zero with intelligence tests 
given 5, 10, and 15 years later to the same persons. It is obvious that 
infant tests do not measure intelligence as it is customarily measured in 
older children. The key to this dilemma seems to be that the infant scales 
primarily measure motor and sensory ability, and the research on relia- 
bility shows that there is some consistency in the development of these 
attributes. It is quite likely that the infant tests would predict motor and 
sensory skills later in life, but interestingly enough, almost nothing has 
been done to test this hypothesis. 

Preschool Tests. Between the ages of 2 and 5 the developing intellectual 
processes become accessible to psychological tests. After the child de- 
velops speech, can manipulate objects, and becomes acquainted with the 
world about him, he can be tested with some of the materials that are 
customarily used in intelligence tests. However, many of the difficulties 
in test standardization and administration still remain. The test materials 
must be largely pictorial or consist of performance problems. 

The difficulty in testing infants is that they usually do little one way or 
the other to indicate the status of their intellectual abilities. Children 
between the ages of 2 and 5 do too much. They are so active and dis- 
tractable that it is difficult to carry on any formal testing procedures. 
Many children in this age group are shy with strangers and will give little 
if any cooperation to the examiner. They are often not highly motivated to 
impress the examiner or themselyes with how well they can perform. 
Consequently, the test must be posed as an interesting game to the child, 
and much depends on the examiner’s skill. 

One of the most prominent tests for young children is the Minnesota 
Preschool Scale (20, 21). There are two equivalent forms, each with 
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Ficure 10-6. Test materials used in the Minnesota Preschool Scale. (Reproduced by 
permission of Educational Test Bureau, Minneapolis. ) 


26 items. Some of the items are as follows (see Figure 10-6 for an illus- 
tration of some of the testing materials): 

lL. Pointing to parts of the body on a doll 

2. Tolling what a picture is about 

3. Naming colors 

4. Digit span 

5. Naming objects from memory 

6. Vocabulary 
l. Copying simple geometrical designs 
8. Block building 

9. Jigsaw puzzle 

10. Indicating missing part in pictures 

Many of the items are similar to those at the lower age levels of the 
Stanford-Binet, The instrument is largely a power test with no emphasis 
on speed, and the items are little concerned with motor skills. Tests at this 
age level which depend on speed and motor skills are probably poorer 
measures, 

The Minnesota scale was standardized on a group of 900 children rang- 
ing in.age from 11% to 6 years. Equivalent-form reliabilities of the total 
scale vary from .80 to .94 (20, 21). There are some reasons to believe that 
the Minnesota scale is not an entirely adequate measure below the age 
of 3, Although scores for children above 3 tend to correlate highly with 
Stanford-Binet scores obtained later, the correlation for children below 
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the age of 3 is only .21 (21). Also, clinical experience indicates that some 


of the test materials are not sufficiently interesting to hold the attention of 
children below the age of 3. Two other preschool tests are the Intelli- 
gence Test for Young Children (50), and the Merrill Palmer Scale (44). 


PERFORMANCE AND “CULTURE FAIR” TESTS 


The customary measures of intelligence have been criticized for their 
emphasis on verbal comprehension. With a test like the Stanford-Binet, 
there is no way, for example, of measuring the ability of the deaf child. 
The child must be able to hear the examiner's instructions and respond to 
what is said. The child who has a speech difficulty is also at a disadvan- 
tage. He will have difficulty in responding even if he knows the answers. 
The child who is new to this country and its language may be classified as 
feeble-minded by the Stanford-Binet yet have above average ability when 
examined in his native language. 

Performance Tests. Tests of the performance type were developed to 
circumvent the traditional emphasis on verbal comprehension. in a per- 
formance test the subject does not necessarily have to read, write, speak, 
or understand spoken language (the instructions can be given in panto- 
mime). One of the most widely used tests of this type is the Pintner- 
Paterson Performance Scale (35). The scale contains 15 items, but the 
10 of these which are considered the best are used in most practic»! work. 
Brief descriptions of the 10 items are as follows: 

1. Mare and Foal. A picture is presented showing a mare and a foal 
in a country setting. Some cut-out pieces are removed from the 
picture. The subject must place the pieces in their proper places. 

2. Seguin Form Board. A wooden board is presented with 10 cut- 
out geometrical forms. The subject is asked to place the forms in 
their proper places. 

3. Five-figure Form Board. Similar to the Seguin Form Board but 
more difficult because each geometrical form is divided into sev- 
eral parts. 

4. Two-figure Form Board. Another form board which proves more 
difficult for most subjects than the Five-figure Form Board. 

5. Casuist Form Board. This form board also uses geometrical de- 
signs which are divided into several pieces. It is made difficult 
because of the close similarity between some parts of the cut-out 
geometrical forms. 

6. Manikin. A wooden figure of a man is assembled from cut-out 
arms, legs, head, and body (see Figure 10-7). 

7. Feature Profile. Wooden pieces are used to assemble the profile 
of a human face (see Figure 10-8). 
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Ficure 10-7. Manikin Test, (From Pintner-Paterson Performance Scale; courtesy of 
C. H. Stoelting Company. ) 


8. Ship Test. A jigsaw puzzle of a ship is put together (see Fig- 
ure 10-9). Ser Pa 

9. Healy Picture Completion, I. Cut-out sections are oe n 
a picture of children playing. The subject is asked to place the 
cut-out sections in their proper places. DEA io 

10. Knox Cube Test. A test of immediate memory involving x 
wooden cubes. The examiner taps the cubes in a sequence an 


asks the subject to do likewise. 


Ficure 10-8. Knox-Kempf Feature Profile Test: Pintner-Paterson Modification. 
(Courtesy of C. H. Stoelting Company. ) 


Ficure 10-9. Ship Picture Form Board. (From Pintner-Paterson Performance Scale; 
courtesy of C. H. Stoelting Company. ) 
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Credit is given on the items for both speed and accuracy. Many of the 
performance tests are so easy for the average person that only through 
the “speeding” of performance is any dispersion of scores obtained. The 
emphasis on speed is probably a weakness of the Pintner-Paterson and 
other similar performance tests. Speed of response is conditioned by age, 
subculture, and personality. Although the point is not settled, it is doubt- 
ful that a heavy emphasis on speed in general intelligence tests leads to 
good measures. 


The Pintner-Paterson is composed almost entirely of jigsaw puzzles in 
one fori or another, The test standardization and the obtaining of norms 
are crude in comparison to tests like the Stanford-Binet. The reliability 


of the test is considerably lower than that for most “verbal” tests. Corre- 
lations between the Pintner-Paterson and traditional intelligence tests 
are usually low. Other performance scales are the Arthur Point Scale 
(4) and the performance scales on the Wechsler tests. Both of these per- 
formance scales are generally more satisfactory instruments than the 
Pintner-Paterson. 

Analisis of Performance Tests. The seemingly good idea of construct- 


ing performance measures of general ability has not borne fruit. The tests 
tend to be too much alike and more in the nature of incidental games than 
measures of important functions. In general the scales have lower reli- 
ability and less validity for the available criteria than do the traditional 


“verbal” tosts. There is little reason to believe that performance tests are 
“better” to use with children in general. The remaining use for these 
measures is in testing the individual with a language handicap of some 
kind. If there is a marked difference in performance by such persons 
between performance tests and “verbal” tests, it is an important diagnostic 
clue. Fewer research funds and much less energy have gone into the con- 
struction of performance scales than into “verbal” intelligence tests. Per- 
haps a more strenuous research effort would produce more useful per- 
formance tests. 

“Culture Fair” Tests. Objections have been made not only to the empha- 
sis on language in traditional intelligence tests but to the inclusions of 
cultural artifacts of all kinds. For example, the Zuni Indian child or a 
child reared in the Tennessee mountains might have difficulty in identify- 
ing a locomotive, interpreting a picture about sea life, or answering ques- 
tions about how city people live. The advent of radio, television, and other 
arms of the mass media of communication has tended to provide all 
people with a common set of cultural symbols and information. However, 
the opportunity to absorb cultural information is a matter of degree. 
Everyone is to some extent isolated in a subculture of his family, friends, 
religion, and geographical environment. Living in some subcultures rather 
than others makes a difference in the ability to score well on traditional 


226 Tests and Measurements 


intelligence tests. The person in the crossroads of our cultur , the urban, 
highly “Americanized” individual, is given the most adv si oe 

In order to have measures which are “culture free or “c iae 2 
tests have been composed which not only make very little ew baag 
but are little dependent on the symbols and information of the ee 
The items on such tests are generally composed of geometrical se 
simple pictures, and abstract symbols. One of the better-kno a -a o 
this kind is the Progressive Matrices Test developed in Engla n i iy" oe: 
(37). The test consists of 60 designs, or matrices, each of which has a a i 
out section. The subject is shown six to eight alternative cut out pieces 
and asked to indicate which should be placed in the matrix (see Fig- 
ure 10-10 for illustrations of the items). Raven has collected extensive 
standardization data on the test in England. Rimoldi (38 ) obtained rather 
similar norms on children in Argentina, suggesting that the test is, to ie 
extent, “culture fair.” High correlations have been found between be 
Progressive Matrices and “verbal” tests. A factor-analytic study ( ») 
shows that the test measures some components of the reasoning factors, 
most likely emphasizing eduction of relationships, plus elements of = 
spatial factors in some items. The Progressive Matrices gives me br 
being an important instrument for testing persons with a langua ge i 5 
cap and for cross-cultural comparisons. Other “culture fair” tc sts are e 
Semantic Test of Intelligence (41), the Culture-free Test of Intelligence 
(12), and the Leiter International Performance Scale (28). i 

The “culture fair” tests have not been studied sufficiently to estimate 
their eventual worth. The critical question is whether the type of ie 
being used can measure the broad functions which underlie human a 
ities. The performance tests have overspecialized in the measurement 0 
speed, spatial factors, and motor skills. The “culture fair tests Se 
apparently been able to tap a broader range of functions, particularly 
some of the reasoning abilities. However, it is questionable whether a test 
which deemphasizes verbal comprehension will be ultimately the most 
valid. Traditional “verbal” measures are still better predictors of the 
available criteria. It may be that as performance in real-life — 
beyond school is studied more fully the “culture fair” type of test wi 
come into its own. 

It would seem that an even more valid test could be developed if some 
way were known for measuring aptitude for verbal comprehension as 
distinct from the status of verbal comprehension at a particular time. The 
only way that is presently available for testing verbal comprehension 1s 
through vocabulary tests and measures of reading skill. It is doubtful as 
an individual would appear “intelligent” or be able to accomplish inte g 
lectual goals if he could not eventually, through training and wider 
acquaintance with the culture, come to master the language and its usage. 


FIGURE 10-10. Two items from the Raven Progressive Matrices Test. ( Reproduced b 
permission of J. C. Raven.) y 
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THE NATURE OF “GENERAL INTELLIGENCE” 


It has been shown that in spite of the separable factors that underlie 
tests of human ability there is a common ground among ability functions 
that can be reliably and usefully measured. The “verbal” intelligence tests, 
both individual and group, generally correlate highly enough with one 
another for us to speak of their common characteristics. If a marked differ- 
ence in test scores is found between two kinds of people witli one of the 
tests, the difference would most likely be reflected in the others. The wide 
use of intelligence tests during the last half-century has shown a number 
of things about the underlying process: 

1. Intelligence cannot at present be measured in children below the 
age of two years and not very well below the age of five or six years. 
Earlier in the chapter the difficulties of constructing tests for infants and 
preschool children were discussed. Perhaps the next decade will show an 
improvement in the early measurement of general ability. 

2. The concept of “general intelligence” is more meaningful with chil- 
dren than adults. Although the evidence on this point is somewhat con- 
flicting, it seems that abilities are more “general” in children. ( omparative 
factor analyses of children and adults tend to show that there is more ofa 
tendency toward one general factor in children. Some of the factors which 
are found in adult populations are difficult to find at all in children. This 
may be due in part to the fact that different test materials have to be 
used with children than with adults. However, the weight of the argu- 
ment is that abilities are more “general” in children. The factorial 
diversity of abilities in adults is probably due to different life experiences 
and different kinds of school and vocational training, It makes more sense 
to use general intelligence tests with children than adults. 

3. Intelligence as measured by current “verbal” tests is partly due to 
heredity and partly due to environment. The most telling arguments for 
this position come from the studies of resemblance between family mem- 
bers in intelligence. Conrad and Jones (13) administered intelligence tests 
to over 200 families in rural New England. They found that for children 
pee the age of five years intelligence of parents and children correlates 
49. Numerous other studies have found correlations very close to 50. 
panei oe. found a correlation of .49 between siblings (be- 
arts (40) polated sisters, or between brothers and sisters). Rob- 
aid be casset iN at io ie of .50 between siblings is what 
TETA T a tifactor inheritance. It is possible that the 
Seer 4 z found between family members could be 
Se ee : er than heredity. Family members tend to share 

environment, talk about topics in common, and have similar 
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kinds of schooling. Environment may explain part of the resemblance, 
but studies of twins (see 2) make it apparent that this is not a complete 


explanation. Correlations between the intelligence test scores of fraternal 
twins (dizygotic) are usually higher than between siblings, usually rang- 
ing from .50 to .70. Correlations between identical twins (monozygotic) 


are usually around .90—almost as high as the reliability of the tests! Fra- 
ternal twins can have very different genetic structures, but the genetic 
structures of identical twins are exactly alike. This leaves little doubt that 
at least a portion, and apparently a sizable portion, of intelligence is due 
to inheritance. 

4. Alter the age of about six years, the individual’s intelligence tends to 
remain stable in respect to his age group. That is, the superior people at 


Ficurr 10-11, Effect of age at initial testing and test-retest interval on prediction of 
later Stanford-Binet IQ from earlier test. (Adapted from Honzik, McFarlane, and 
Allen, 23.) 
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one age level tend to be the superior people at other age levels (see Figure 
10-11 for evidence on this point). The relationship is far from perfect, 
and isolated individuals may show drastic changes over a period of years. 

5. From the test results obtained at one period in time, intelligence 
increases with the age of the persons tested up to the late teens. The intel- 
ligence level gradually decreases throughout adulthood and old age (see 
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Figure 10-12). This has been erroneously interpreted to mean that each 
person’s intelligence grows and then declines. This is not at all what 
Figure 10-12 shows. It shows the scores of different persons of different 
ages tested at about the same time. There are a number of different ways 
of explaining the trend in addition to the one that older people decline in 
intelligence. The older people probably had less schooling, were reared 


Ficure 10-12. Age changes in full-scale standard scores on the Wechsler-Belleyue 
Intelligence Scale. (Adapted from D. Wechsler, 52, p. 29.) 
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in different times and different environments, and reacted differently to 
psychological tests (particularly to highly “speeded” tests). it will be a 
long time before the growth and decline of intelligence for individuals 
will be determined. Nevertheless, it is important to note that older people 
generally make lower scores than their younger contemporaries. Whether 
the same finding will be obtained ten or twenty years from now is an 
interesting speculation. 

6. There are definite group differences in intelligence test scores. Lower 
scores on the average are made by people of low socioeconomic status, 
people living in rural areas, people living in the Southern or Southwestern 
part of the United States, immigrants from southern Europe, and Indians 
and Negroes. The interpretation of differences of this kind places a 
large strain on the logical foundation of intelligence tests. As was dis- 
cussed previously in this chapter, the traditional measures of intelligence 
are constructed in such a Way as to favor certain groups. The question is 
whether or not the subgroups which tend to make lower scores would 
make high scores if afforded the advantages of the wider culture. There 
is some evidence to show that they would. It was found (25) that South- 
ern Negro children who migrate to New York make higher scores the 
longer they are in the city environment. Another point that should be con- 
sidered is that even though some subgroups score lower on the average 
than others, at least some high-scoring individuals can be found in each. 
The whole question of ethnic and socioeconomic differences is highly 
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charged with emotion in both professional and lay circles and is a point 
about which much more information needs to be obtained. 

7. Many different kinds of attainment involve intelligence as measured 
by traditional “verbal” instruments. In particular, intelligence is one of 
the major factors in successful school work. Numerous correlations of .50 
and above between particular tests and school grades were cited in this 
chapter. Intelligence test scores differentiate occupational: groups (see 
Table 10-3). However, it should be noted that there is considerable over- 


Taste 10-3. AGCT STANDARD Scores oF OCCUPATIONAL GROUPS 
IN THE SECOND WorLD WAR 


(Adapted from N. Stewart, 43) 


Percentile 

Occupational groups 10 25 50 15 « 90 
Accountant 114 121 129 136 143 
Teacher 110 117 124 132 140 
Lawyer 112 118 124 132 141 
Bookkeeper, general 108 114 122 129 138 
Chief clerk 107 114 122 131 141 
Draftsman 99 109 120 127 137 
Postal clerk 100 109 119 126 136 
Clerk, general 97 108 117 125 133 
Radio repairman 97 108 117 125 136 
Salesman 94 107 115 125 133 
Store manager 91 104 115 124 133 
Toolmaker 92 101 112 123 129 
Stock clerk 85 99 110 120 127 
Machinist 86 99 110 120 127 
Policeman 86 96 109 118 128 
Electrician 83 96 109 118 124 
Meatcutter 80 94 108 117 126 
Sheet metalworker 82 95 107 117 126 
Machine operator 77 89 103 114 123 
Automobile mechanic 75 89 102 114 122 
Carpenter, general 73 86 101 113 123 
Baker è 69 83 99 113 123 
Truck driver, heavy 71 83 98 111 120 
Cook 67 79 96 lll 120 
Laborer 65 76 93 108 119 
Barber 66 79 93 109 120 
Miner 67 T 87 103 119 
Farm worker 61 70 86 103 a 
Lumberjack 60 70 85 100 


EE 
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lap between most groups. Comparing the extremes in Table 10-53, the top 
10 per cent of lumberjacks score higher than the lower 10 per cent of 
accountants. Scores on intelligence tests are predictive of success on 
many but not all jobs (see Table 10-4). The fact that the correlations 
between test scores and job success are near zero in some cases does not 
necessarily mean that intelligence is not important for the job. The disper- 
sion of intellectual ability usually is narrowed considerably in most jobs 


TABLE 10-4, MEDIAN VALIDITY COEFFICIENTS OF INTELLIGENCE TrsTs 
FOR VARIOUS OCCUPATIONAL GROUPS IN THE PREDICTION OF JOB PROFICIENCY 


(Adapted from E. E. Ghiselli and C. W. Brown, 17, p. 577) 


Occupational Median validity Number of 
group coefficient validity coefficients 
Clerical workers .35 85 
Supervisors 40 9 
Salesmen 383 4 
Sales clerks —.09 18 
Protective service 25 6 
Skilled workers 55 6 
Semiskilled workers .20 45 
Unskilled workers .08 13 


eS 


by the individual's gravitating toward a job at which he can work com- 
fortably and by the selection procedures that are used in industri] set- 
tings. Also, it is important to note that the variability of intelligence test 
scores is higher for lower-level than for higher-level jobs. This indicates 
that, as would be expected, intelligence is a more important determiner 
of success in high-level occupations. The individual's ability to succeed is 
determined by his intelligence and by a host of other things as well: abili- 


ties not measured by intelligence tests, interests, personality traits, and 
just plain luck. 
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CHAPTER 1] 


Special Abilities 


There are many human attributes to measure other than the kinds of 
abilities which were discussed in the previous two chapters. There we 
talked about the components of intelligence or intellectual ability, These 
are usually thought of as the “higher processes,” the most prized of abili- 
ties. However, individual differences are as easily found, and sometimes 
as important to study, in sensory, motor, mechanical, and artistic abilities. 


In order to understand fully an individual’s potentialities and liabilities, 
much must be learned about him in addition to his intellectu] capabili- 
ties. Two persons could make the same score on a general intelligence test 


and yet be very different in other important ways. Similarly, for all of the 
factors which were described in Chapter 9, two persons could have exactly 
the same profile of scores and differ importantly in terms of other abilities. 
One person might be underweight and frail, the other an excellent physi- 
cal specimen. This would make quite a difference if they were both being 
considered for jobs as arctic explorers, test pilots, or steeple jacks. One 
person might have a flair for mechanical work, whereas the other could 
have little ability. This would be important to know if they were both 
considering careers as mechanical engineers, There are many, many other 
ways in which the two persons could differ importantly—in auditory 
acuity, depth perception, finger coordination, musical ability, sense of 
balance, athletic ability, resistance to disease, and so on. 

It would be futile to try to list even a sizable fraction of the special 
abilities that are touched on in psychological work. Instead, a number of 
abilities will be described which are important concerns in school place- 
ment, vocational counseling, and personnel selection, 


SENSORY TESTS 


The first tests in psychology, those of Galton and his immediate fol- 
lowers, were largely measures of sensory ability: the fineness of sensory 
thresholds, reaction time, and sensory acuity. The early investigators 
became discouraged with these measures because they failed to predict 
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school grades, teachers’ ratings of intelligence, and other criteria of intel- 
lectual accomplishment. However, sensory tests have come back in the 


last several decades as important measures in their own right. Attention 
will be given here to some components of vision and audition, omitting 
the inter: -ting individual differences to be found among the remainder of 
the “five senses”: touch, smell, and taste. Indeed, the range of individual 
difference. extends to numerous other senses that are seldom listed. There 
are internal senses like thirst, sensitivity to movement of the limbs, and 
awareness of specific organic tensions. 
VISION 


The popular practice of talking about “good” and “poor” eyesight does 
considerable injustice to the complexity of visual functions. There are a 
number of separable and only partially related kinds of “good” vision. A 
primary distinction must be made between near acuity and far acuity. 
Near acuity concerns how well the individual can discern visual forms 
within 1 or 2 feet of his eyes. Far acuity concerns how well the individual 
can discern visual forms placed 20 or more feet away. A third component 
of “good” vision is depth perception, the ability to judge the proximity of 
objects to one another. Another component is the ability to distinguish 
colors. Although it is commonplace to think of “color blindness” as a 
unitary characteristic, there are different kinds of color blindness. Also, 
the ability to distinguish colors is partly a matter of degree rather than 
an all-or-none attribute. 

Wall Charts. The most familiar measure of visual ability is the ordinary 
wall chart, which nearly everyone has encountered in applying for a 
driver’s license or taking a physical examination. The Snellen Chart is 
used extensively for this purpose. It consists of rows of letters, each lower 
row containing smaller letters than the one above it. The chart is placed 
20 feet from the subject. If he can read the row of letters that the average 
man can, he is said to have 20/20 vision. If he can read the row of letters 
that the average person can read at only 15 feet from the chart, he is said 
to have 20/15 vision. Similarly, if he needs to stand 20 feet from the chart 
to read the row that the average person can read from 40 feet, he is said 
to have 20/40 vision. 

Although the Snellen Chart is an adequate device for detecting gross 
deficiencies in visual acuity, it has a number of disadvantages. Like all of 
the wall charts, it tests only far acuity. A school child could have excellent 
far acuity and still have crippling visual defects of other kinds. Some 
alphabetical letters are easier to distinguish than others, and this is not 
taken into account when using the Snellen Chart. Also, the rows of letters 
are easy to remember, and the test can often be “faked” by the person 
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who has some prior knowledge of the chart. The amount of light on wall 
charts should be carefully controlled, but in much practica! work this is 
given little consideration. When controlled conditions are obtained, the 
reliabilities of the Snellen and other wall charts are satisfactorily high. 
One study (51) reports a reliability coefficient of .88 for the Snellen 
Chart. 

Color Vision. One of the oldest tests for color vision is the Holmgren 
Woolens. The subject is given different colors of yarn and asked to sort 
the ones that are alike. It is a crude test which serves only to distinguish 
persons who are very deficient in color vision. A more systematic measure 
can be obtained with the Ishihara color plates.’ The plates are composed 
of small patches of color. The person who has good color vision can see 
a number on the plate. The color-deficient person either does not see the 
number or sees a different number. More recent color vision tests are the 
Farnsworth Dichotomous Test for Color Blindness (15), the Farnsworth- 
Munsell 100 Hue Test (16), and the Illuminant-Stable Color Vision Test 
(19). In order to keep color vision tests adequately standardized they 
must be used in the same illumination and protected from fading or 
soilage. 

Multiple-component Tests of Vision. In recent years devices have been 
constructed which test a number of different aspects of vision. The three 
best-known instruments are the Ortho-Rater (Bausch and Lomb), the 
Sight-Screener (American Optical Company), and the Telebinocular 
(Keystone View Company). Each of these instruments tests near vision, 
far vision, depth perception, color discrimination, and control of the eye 
muscles. In general, the multiple-component instruments are a consid- 
erable advance over the older, single-component tests. 


AUDITION 


i The sense of hearing is also composed of a number of different func- 
tions, Only auditory acuity will be treated here, the ability to detect 
faint sounds. Auditory acuity is itself complex: the person who can hear 
well at one tone level may be near deaf at higher or lower frequencies, 
Because of their relevance to musical aptitude, some of the other auditory 
functions will be considered in a later section. 

The older tests of auditory acuity employed sound sources like whis- 
pered speech or the ticking of a clock. In the whispered speech test, the 
examiner stands some distance from the subject and whispers a number 
of words. The subject tries to say what each word is in turn. The examiner 
walks farther and farther from the subject to determine the distance at 
which the whispered words can be heard. Although tests of this kind are 

* Obtainable from C. H, Stoelting Company. 


Special Abilities 239 


adequate for detecting gross losses of hearing, they have a number of 
defects. li is difficult to standardize both the loudness and clarity of 
whispered speech. One examiner will inevitably whisper a bit louder 
and/or clearer than another in spite of the best efforts at standardization. 
Such tests measure auditory acuity within a narrow range of the tone, or 
frequency, continuum. The person who can hear whispered speech well 


might not be able to hear sound at a different frequency, such as the 
sound of a ticking clock. The difference in the acoustical properties of 
testing rooms and the problem of ruling out extraneous noises add to the 


difficulty of standardizing this type of test. 


Ficure 11-1. Audiogram of a child with severe high-tone deafness. (Adapted from 
L. A. Watson and T. Tolan, 48.) 
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A number of instruments have been developed for measuring auditory 
acuity at different points on the frequency continuum. These are called 
pure-tone audiometers (see 11, 48). Earphones are used to test one ear 
at a time. The standard procedure is to gradually raise the sound intensity 
until the subject indicates that he can hear the tone. Then, starting with 
a sound that the subject can hear well, the intensity is lowered to a point 
where it can no longer be heard. (You will remember this as one of the 
psychophysical methods for determining thresholds, the method of limits. ) 
The procedure is repeated at different frequency levels. The resulting 
data can be plotted as a profile of auditory acuity (see Figure 11-1). 

Pure-tone audiometers are now available for group testing. Earphones 
are given to all of the subjects. Standard answer sheets are used for the 
subjects to indicate whether or not they hear the tone at different intensi- 
ties. It is not possible to determine the individual’s auditory acuity as 
finely in the group testing situation. The individual who shows a marked 
loss in any frequency range on the group test should, if possible, be given 
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the individual test to determine more accurately the nature of his hear- 
ing deficiency. 


PRACTICAL USES FOR SENSORY TESTS 


The individual with a visual or hearing defect is at a definite disadvan- 
tage in many performance situations. Particularly, the young school child 
is handicapped. It is sometimes found that poor vision or hearing is 
responsible for apparent dullness. The child who cannot see the black- 
board or hear the teacher will learn very little regardless of his latent 
capacity for learning. Reading difficulties can often be traced to visual 
defects. In order to speak correctly the child must imitate the speech 
of others. He cannot imitate what he cannot hear. To do well on most 
psychological tests, the child must be able to hear the instructions and to 
read the test content. As is common practice in many school programs, 
children should be systematically tested at different age levels for audi- 
tory and visual ability. 

Even the sophisticated adult is often unaware that he has a visual or 
auditory defect. It often happens that individuals who wear glasses for 
the first time are surprised that the world “looks that way.” Persons with 
poor far acuity often take it for granted that everyone sees ohjects more 
than 50 feet away as “big blurs.” Consequently, the individus! cannot be 
relied upon to detect his own sensory difficulty and seek help. He is likely 
to blame his inability to perform well on “dumbness” rather than on a 
visual or hearing loss. 

Sensory ability is necessary for many different occupations. The classic 
example is the baseball umpire’s dependence on “good eyesight.” Color 
discrimination is paramount to the interior decorator, the tailor, and the 
artist. A moderate level of auditory acuity is necessary for most jobs, 
particularly if it is required to talk with others or to follow spoken instruc- 
tions. However, there are few jobs in which high-level sensory ability is 
the primary attribute. Even jobs that would seem to depend heavily on 
sensory acuity usually require only average ability. A typical example is 
that of a sonar operator, who uses a sound echoing device to detect hostile 
submarines. It was found that up to a certain point auditory acuity is a 
“must” for the job; but above the level of average hearing, auditory acuity 
is not a predictor of good and poor sonar operators. 

Sensory disabilities are often prominently involved in adjustment prob- 
lems. It is not uncommon for the person who is partially cut off from his 
environment because of a sensory disability to become withdrawn, de- 
Pressed, and resentful, Sensory tests can be used to detect difficulties in 
audition and vision. Correction or compensation for the sensory disability 
often leads to an improvement in the individual's personal relations. 


á Special Abilities 241 


MOTOR DEXTERITY 


It has long been recognized that the person who works well with his 
“head” does not necessarily work well with his hands. Accomplishments 
like shaping a fine piece of pottery, hitting a home run, and operating 


complex machinery, have little to do with intelligence or with formal 


school training. 


Among the oldest motor tests are the peg boards, designed to measure 
arm, hand, and finger dexterity. A typical example is the Stromberg Dex- 
terity Test (46; see Figure 11-2). The first part of the test requires the 


Ficure 11-2 Stromberg Dexterity Test. (Courtesy of The Psychological Corpora- 
tion, ) 


subject to place 60 cylindrical blocks into holes as fast as he can. In the 
second part, the blocks are removed, turned over, and put back in the 
holes. Another widely used test is the Crawford Small Parts Dexterity 
Test (10; see Figure í 1-3). In the first part of the test, the subject uses 
tweezers to place pins in holes and then places a small collar over each 
pin. In the second part, small screws are put in place with a screwdriver. 

Some tests are designed specifically to test how well the individual can 
work with tools and small mechanical parts. A typical test of this kind is 
the Bennett Hand-tool Dexterity Test (4; see Figure 11-4). The test 
requires the subject to remove and replace nuts and bolts as quickly as 
possible. 

More complex tests involving hand, arm, and leg coordination have 
been designed for particular jobs. One of the best known of these is the 


242 Tests and Measurements 


Ficure 11-3. Crawford Small Parts Dexterity Test. (Courtesy of The Psy chological 
Corporation. ) 


Complex Coordination Test (36) used for the selection of pilots by the 
Air Force (see Figure 11-5). The test is a partial replica of an airplane 
cockpit, complete with stick and rudder. Lights on a control panel simu- 
late the maneuvers of an airplane. The subject must use stick and rudder 
to match the stimulus lights, the counterpart in the test of coordinating 
stick and rudder as required by the situation in an airplane. 

Analysis of Motor Tests. Most tests of motor dexterity are highly 
dependent on speed, Consequently, they prove to be better predictors of 
jobs in which speed rather than quality is important. There are many jobs 
in which speed is only a minor consideration, The person who can saw a 
board quickly does not necessarily have the craftsmanship of the skilled 
cabinetmaker, 

Motor dexterity tests have acceptably high reliabilities, usually over 80 
and sometimes over .90, An important characteristic of motor tests is that 
they tend to correlate very little with one another. Slightly different 
manipulations of the same material often have little in common. For 
example, the two parts of the Crawford Small Parts Dexterity Test corre- 
late on the average less than .50 (10). A correlation of only .57 was found 
between the two parts of the Stromberg Dexterity Test (46). Correlations 
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Ficure 11-4, Bennett Hand-tool Dexterity Test. (Courtesy of The Psychological 
Corporation. ) 


between different motor tests prove to be even smaller. Factor-analytic 
studies of motor dexterity tests have generally found few broad common 
factors. Tests in this area are characterized by high specificity. 

Because of the small overlap between motor dexterity tests, there are 
no general measures of motor ability such as the general intelligence tests 
supply for intellectual functions. Motor tests are then of relatively little 
use in vocational counseling. They are most legitimately used in industrial 
selection where the job is simple, requires a definite set of motor skills, 
and is highly dependent on speed. Among jobs of this kind are those in 
production line work, sewing machine operation, and packaging. hi 

Motor dexterity tests show at best only moderate predictive validity 
for most situations in which they are used (see Tables 11-1, 11-3, 11-4, 
and 11-5). However, if they are used in conjunction with other ability 
tests, they often add a small, but important, increment to the over-all 
validity of the battery. Motor tests tend to be more valid when they are 
made to resemble the actual machine or instrument which is featured on 
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Ficure 11-5. Complex Coordination Test. (Courtesy of the U.S. Air Force.) 


the job. Tests designed in this way are called job miniatures. If the job is 
that of lathe operator in a machine shop, the best motor test would 
employ a miniature lathe with the same kinds of dials, handles, and con- 
trols that appear on the real lathe. During the Second World War, the 
Army Air Force used a variety of motor tests for the selection of pilot 
trainees. The Complex Coordination Test, which resembles most closely 


what the pilot actually does, generally proved to be one of the most 
valid instruments. 


MECHANICAL APTITUDE 


Mechanical ability is popularly thought of as concerning the making 
and fixing of things as distinct from clerical, sales, administrative, and 
professional abilities. In general, we speak of mechanical ability in rela- 
tion to trades and various levels of skilled work. There is no fine dividing 


Special Abilities 245 


line between mechanical occupations and those that are not mechanical. 


Some occupations that most of us would classify as mechanical are 
plumber, carpenter, automobile mechanic, and boat builder. 

There is no one type of test function which underlies mechanical work 
to the same extent that the general intelligence tests relate to school work. 


In order satisfactorily to predict a particular mechanical job, a range of 
different kinds of tests must be used in a battery. Different combinations 
of tests are usually needed for different jobs. Some of the kinds of tests 
that have proved useful in the prediction of mechanical work and in 


vocational guidance are described in the following sections. 
Intellectual Ability. Because an individual is involved in making and 
fixing things, it does not mean that intelligence is an unimportant 


attribute. When it is possible to do so, either a battery of the major intel- 
lectual factors, or, at least a general intelligence test, should be tried as a 
predictor. The spatial and perceptual factors are very useful in predicting 
many mechanical jobs. Some tests embodying these functions will be 
considered in the subsequent discussion. However, it is important also to 
consider the verbal, numerical, and reasoning factors as well in the pre- 
diction of mechanical work. Because the “verbal” intelligence tests are 
mainly composed of these factors, a good general intelligence test is often 
one of the best predictors of job success (see Tables 11-1, 11-2, 11-3, 
11-4, and 11-5). 


Taste 11-1. CORRELATIONS or ABILITY Test Scores WITH MEASURES OF Jos 
PERFORMANCE 
(Adapted from E. E. Ghiselli, 20) 


_ eee 


Test Type of job 
Protective Skilled Semi- 

Clerical service trade skilled Unskilled 

iene eS SS eee 
General intelligence .36 .28 45 .20 .16 
Arithmetic 42 —.12 3 15 
Number comparison 28 25 san .15 15 
Spatial relations 06 ore 45 .30 21 
Mechanical principles sara sista A5 25 
Finger dexterity 22 20 21 30 05 


General intelligence tests tend to be more predictive of how well the 
individual does in job training than how well he performs subsequently 
on the job. This is probably because the training phase requires more 
abstract ability. In many cases the training program involves classroom- 
like procedures, reading of materials, and the learning of machine opera- 
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tions. These are the kinds of things that intelligence tests predict best, 
Predictor tests usually correlate higher with performance in training than 


with later job performance bec 
ably measured in the former th 


ause, as a rule, performance is more reli- 
an in the latter. Progress in training is 


usually graded more carefully. There is more of an Opportunity to observe 


the worker, and, in many cases, tests are 


used to assess progress in training, 


TABLE 11-2. MINIMUM MENTAL Aces FOR SEVERAL Joss 
(From A. S. Beckman, 3) 


Boys 


Girls 


5 Dishwasher Sewer (simple patterns) 
Vegetable parer 

6 Mixer of cement Mangle operator 
Freight handler Crocheter (open mesh) 

7 Painter (rough work) Cross stitcher 
Shoe repairer (simple tasks) Hand-iron operator 

8 Haircutting and shaving Scarf-loom operator 
Gardener Dressmaker (not including pattern 

work) 

9 Foot-power printing-press operator Fancy-basket maker 
Mattress and pillow maker Cook (simpler dishes) 

10 Sign painter Sweater-machine operator 
Painter (shellacking and varnish- Launderer 

ing) 
ll 


Storekeeper 
Greenhouse attendant 


Librarian’s assistant 


Power sealer in cannery 


ae general intelligence tests tend to be more predictive of success in 
pigh-skill rather than low-skill jobs. That is, validities are usually higher 
than th ; à and complex machine operators 

an they are for jobs like truck drivers and furniture movers. The dif- 
ability i ably due to the increased importance of abstract 
ility in more highly skilled work. In selecting people for unskilled 
et up minimum standards of intelligence rather 
igh intelligence (see Table 11-2). 
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a typical job situation he is lying under the automobile and must remove 
a nut from the engine above him. The nut is slanted at a 45-degree angle, 
and he must remove it with a wrench that has two joints. The mechanic 
must orient himself spatially to the complex of angles and movements in 
order to do such work, In draftsmanship, it is necessary to portray three- 
dimensional objects on two-dimensional pieces of paper. In some 
drawings the objects must be shown in tilted positions or partially dis- 
assembled. It takes spatial ability, either spatial orientation or visualiza- 
tion, to work as a draftsman and at many other jobs. 


Ficurr 11-6. Sample items from the Revised Minnesota Paper Form Board. 
For each item, the subject must choose the figure which would result if the 
pieces in the first section were assembled. (Courtesy of The Psychological Corpo- 


ration. ) 


One of the best known spatial tests for mechanical aptitude is the 
Minnesota Paper Form Board Test (33, 39). It is a useful predictor of 
grades in shop courses, supervisors’ ratings of workmanship, objective pro- 
duction records, and many other measures of mechanical performance 
(see Figure 11-6). 

Perceptual ability is required in a variety of jobs, The perceptual speed 
factor has been used most often as a predictor, but it is likely that the 
other perceptual factors will eventually find their place in vocational 
guidance and job selection. The individual who sits by a fast-moving 
conveyor belt and looks for flaws in manufactured products uses per- 
ceptual ability. Any job in which it is necessary to detect aspects of a 
visual scene requires perceptual ability to some extent. 

Examples of perceptual tests can be found in some of the multifactor 
batteries which were discussed in Chapter 9. Some tests designed spe- 
cifically for industrial selection will be described in the section on clerical 
aptitude, (See Tables 11-1, 11-3, 11-4, and 11-5 for typical validities. ) 
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TABLE 11-5. AVERAGE VALIDITY COEFFICIENTS OF VARIOUS APTITUDE Tests 
FOR PROTECTIVE OCCUPATIONS, SERVICE OCCUPATIONS, 
AND VEHICLE OPERATORS 


(From Ghiselli and Brown, 21, p. 229) 


Protective Service Vehicle 
f occupations occupations | operators 
Rupert Train- | Profi- | Train- | Profi- | Train- | Profi- 
ing ciency ing ciency ing ciency 
Intellectual: | 
Intelligence ............. 6 26 50 07 18 14 
Immediate memory ...... .28 26 
Arithmetid t essi sodas. 30 .08 59 = (14 14 04 
Number comparison ...... Bee 25 ao 14 
Name comparison ........ oe 36 T =i 
Cancellation ............ as E Raye —.27 
Spatial and perceptual: 
Tocatton ai «tata as <i ot Ben aes aiis 18 
Spatial relations ......... 33 04 42 sje 21 
Speed of perception ...... 30 cals RA vee 08 
Mechanical principles ....| 41 .20 Sate eae 36 21 
Motor: 
TAPP se eet 
POULIN Neve A tee i 
Finger dexterity ......... 19 
Hand dexterity .......... ae TA Ss =,09 
Arm dexterity eshe os. « one ies Sere = 01 
Simple reaction time ..... os Se ore iste eee 21 
Complex reaction ........ aor ae axe ae ces 35 
ee 


Motor Dexterity. The kinds of motor tests which were discussed earlier 
in the chapter are often featured in mechanical aptitude batteries. Motor 
tests were placed in a separate section because they are involved in a 
number of aptitudes other than mechanical aptitude, athletic aptitudes, 
for example (see Tables 11-1, 11-8, 11-4, and 11-5 for typical validities). 

Mechanical Comprehension. Among the most successful tests of 
mechanical aptitude are those designed to measure the mastery of 
mechanical principles, or the ability to reason with mechanical problems: 
In a typical problem, a truck driver is rushing medical supplies to a fire- 
damaged town. He discovers that his truck is about an inch too tall to 
clear a bridge leading to the town. What should the truck driver do? The 
answer is to let some air out of the tires, As another example, a motorist 
must remove a boulder which blocks the road, He finds a long, stout 
pole to do the job, He must then decide whether to use the pole as a pry 
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or a lever. He can construct a lever by balancing the pole on a rock 
placed between the boulder and himself (the lever exerts more force 
than a pry). After deciding to use the lever action with the rock as a 
balancer (a fulcrum), he must then decide where to place the balancer 
(rock) and how to best exert his strength against the pole. As a third 
example, a hoist is being built to lift tree trunks into the bed of a truck. 
A system of gears and chains is set up to transfer the power from an 
electric motor to the hoist. It must be decided how large the different 
gears should be to give the desired power to the hoist. 


Ficurr 11-7. Sample items from the Bennett Mechanical Comprehension Test, 
Form AA. (Courtesy of The Psychological Corporation. ) 


Which room has more of an 
echo? 


Which would be the better 
shears for cutting metal? 


The three problems above are typical of those found in mechanical 
comprehension tests. Tests of this kind tend to range over several ability 
factors. Because of the paucity of factor-analytic studies of mechanical 
comprehension tests, it is not always possible to say just what a particular 
test measures, Some of them emphasize the spatial factors, which are 
prominently involved in many tests of mechanical aptitude. Other func- 
tions which appear in some mechanical comprehension tests are numer- 
ical computation, various aspects of the reasoning factors, and familiarity 
with tools and machinery. Examples of mechanical comprehension tests 
are the Bennett Mechanical Comprehension Test ( 5, 6) (see Figure 
11-7), the Mechanical Reasoning Test of the DAT (discussed in Chap- 
ter 9), and the SRA Mechanical Aptitude Test (40). 
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Mechanical Information. One of the most useful measures for the selec- 
tion of skilled and semiskilled workers is a test of information, or knowl- 
edge, about tools and machinery. For example, a set of questions like the 
following would be useful in the selection of automobile mechanics: 

. What is a torque converter? 

. What is a ratchet? 

. Where is the “needle valve” in an automobile? 

. What source of power is used to run the generator in most auto- 
mobiles? 

5. How do you recognize “pre-ignition”? 

Information tests can be constructed either to measure general knowl- 
edge of mechanical work or to measure knowledge of one particular = 
It is usually the case that a test constructed specifically for one job, P 
as for television repair men, will be more predictive than a pe. 
knowledge in general about mechanical work. However, the test whic 
is constructed specifically for one job is likely to be useful only for select- 
ing personnel for that job or for closely related jobs. Also, because a 
specific knowledge which the instrument measures, it is usually of litt ; 
use in vocational guidance, where it is usually necessary to measure w 
functions rather than highly specialized information. A more genera 
measure of mechanical knowledge is often useful in vocational guidance. 

Analysis of Mechanical Aptitude Tests. A sufficient personnel-selection 
program usually requires a careful study of the particular industrial 
setting. The diversity of psychological functions which is required by 
different jobs makes it necessary to try out a range of tests to find the 
ones that will work well in practice. Also, it is often necessary to invent 
and construct tests for particular jobs. ; 

Few of the mechanical aptitude tests have been studied as extensively 
as the tests of intellectual ability. Consequently, it is usually necessary to 
perform considerable research in the job setting to determine the utility 
of particular tests. In few cases have norms for the tests been obtained 
ona sufficiently representative sample to use them as dependable guides. 
It is generally more meaningful to obtain local norms for particular per- 
sonnel selection or school programs. 

Mechanical aptitude tests 
reliable. Reliabilities in man 
classification tests. Mech 
many different jobs, the 
seemingly small validitie 
ber of standpoints. Prim 
such as those measured 
success. Another possibil 
ment, is unreliable and 


A OD re 


are fairly easy to standardize and s 
y cases approach those of the better genera 
anical aptitude tests have modest validity for 
amount varying considerably with the job. The 
s for some jobs should be regarded from a aum 
arily there is the possibility that formal abilities 
in psychological tests have little to do with job 
ity is that the criterion of job success, the assess- 
consequently cannot be predicted. If the as- 
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sessment is determimed only from the sketchy impressions of foremen 


and managers, it is seldom very reliable. It is meaningful in this instance 
to make the correction for attenuation as discussed in Chapter 6, correct- 
ing for the unreliability of the assessment. Because of the unreliability 


inherent in most job assessments, mechanical aptitude tests are often 
considerably more valid than apparent from the correlation coefficients 
(this was discussed more fully in Chapter 6). The third point to con- 
sider is that the modest to low individual validities should not obscure 
the fact that a combination of several tests in a battery will often produce 
reasonably good predictive efficiency. 


CLERICAL AND STENOGRAPHIC APTITUDES 


Clerical Aptitude. Clerical aptitude, as it will be discussed here, con- 
cerns the office clerk, the individual who deals with files, ledgers, ac- 
counts, and correspondence. The term clerk is much more general than 
this particular usage, referring variously to grocery clerk, department 
store clerk, and even court clerk. Perceptual speed tests have a special 
importance in the prediction of clerical performance. Perceptual speed 
is involved in chores like proofreading letters, searching for particular 
accounts in a long list, and alphabetizing names. 

A typical test which emphasizes perceptual speed is the Minnesota 
Clerical Test (1). The test is divided into two separately timed parts, 
Number Comparison and Name Comparison (see Figure 11-8). Retest 


Ficure 11-8, Sample items from the Minnesota Clerical Test. (Courtesy of The Psy- 
chological Corporation. ) 


66273894 ————— 66273984 
527384578 ———— — 527384578 
New York World _______ New York World 
Cargill Grain Co, —_______ Cargil Grain Co. 


If the two names or numbers ofa pair are exactly alike, make a check mark on the 
line between them, 


reliabilities range from .85 to .91. Extensive validation research shows 
correlations ranging up to .60 between the Minnesota Clerical Test and 
business school and job performance criteria. Other perceptual speed 
tests for clerical aptitude are the Clerical Speed and Accuracy Test of 
the DAT, the Clerical Perception Test on the GATB, and parts of the 
General Clerical Test (50). Ie Me 
Perceptual speed is a necessary component of many clerical jobs, but 
other functions should be tested as well. Verbal comprehension is a 
desirable attribute, especially a knowledge of spelling and grammar. 
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Clerical work often involves routine arithmetical operations, and there- 
fore, a numerical computation test is likely to be a useful selection instru- 
ment. If the clerical job requires the use of accounting machines or other 
equipment, some of the motor dexterity tests may be of use. The DAT 
tests not only perceptual speed but a variety of verbal and arithmetic 
abilities; hence, it probably would serve well in the selection of office 
clerks. The General Clerical Test (50) is a short battery designed to 
cover the major functions required in clerical work. Scores on nine sub- 
tests are combined to form clerical, numerical, and verbal scores. 
Stenographic Ability. The selection of stenographers is best made on 
the basis of specific job requirements. Typing and shorthand ability are 
the major requirements in most stenographic jobs. Therefore, proficiency 
tests in these skills offer a sound basis for the selection of stenographers 
(for typical tests, see 7, 8, 47). Here as in many other testing problems, 
the needs of a selection program are not always the same as those of the 
vocational guidance situation. In vocational guidance, before the indi- 
vidual has had an opportunity to test himself in specialized training or 
on the job, some prediction must be made as to how well he will perform. 
Although an insufficient amount of research has been done on the apti- 
tude for stenographic work to allow us to speak with certainty, the most 
promising attributes seem to be the motor skills involved in typing, meas- 
ures of verbal comprehension and language usage, and interest in steno- 


graphic work. 


ARTISTIC APTITUDES 


The nature of art and artistic ability have been matters of interest to 
psychologists for well over a hundred years, dating back to and before 
Fechner. In spite of this long interest, the measurement of artistic ability 
lags behind the testing of other ability functions. This is due in part to 
practical considerations. There has always been a more urgent need for 
intellectual and vocational tests than for tests of artistic ability. Research 
on classifying men in the Armed Forces, testing children in school, and 
selecting men in industry, has won financial support because of the 
immediate gains to be expected. Although the study of artistic ability 
offers some practical advantages, it never has promised a sufficient com- 
mercial market to attract large numbers of psychologists. 

Another reason that tests of artistic ability lag behind is the intrinsic 
complexity of the functions to be measured, In this area, it is very difficult 
to distinguish aptitude from achievement. The accomplished musician 
or painter can be judged by what he currently does. But it is difficult to 
find the underlying aptitudes that give one child an advantage over 
another in reaching eventual artistic accomplishment. 

Good art is largely a matter of time and place. Chinese music sounds 
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cacophonous to us, and no doubt much of our music seems strange to the 
Chinese. Some primitive music centers almost entirely on the drum and 
other percussion instruments. Complex rhythmic patterns used are too 
elusive for the “civilized” ear. We would miss the esthetic appreciation 
that the primitive has for his music as much as he would be baffled by the 
symphony orchestra. The delicate sensitivities of a Japanese poem are lost 
on an Occidental audience. We might cite many other examples to show 
that art is a matter of values. Different people have different values, and 


values change over the ages. 
Different abilities are involved in the production of art and in the 
appreciation of art. The music critic may not be a musician at all; the art 


historian may have never painted. Different abilities are required in the 
production of different kinds of art work. 

The measurement of artistic aptitude evolves into several components. 
For producing works of art there are probably some underlying abilities 
that cut across different times and different cultures. In graphic art work, 
the ability to make line drawings, to combine colors, and to achieve prop- 
erties of “balance” are required in most paintings. In musical ability there 
are the basic sensory skills of tonal memory, sense of pitch, and recogni- 
tion of rhythms, which to some extent cut across different kinds of musical 
production, 

Another attribute which can be tested is the appreciation of art forms. 
Appreciation is dependent on the values in a particular culture and on the 
individual’s knowledge and acceptance of those values, Finally, tests can 
be made of how well the individual can produce particular art forms; 
such achievement is dependent both on his initial aptitude and the train- 
ing that he has had. 


MUSICAL APTITUDE 


Seashore Measures. One of the oldest and most widely used musical 
tests is the Seashore Measures of Musical Talents (42). The test stimuli 
are reproduced on phonograph records, which can be used for the testing 
of moderate-sized groups of subjects. The battery includes the following 
subtests: 

1. Pitch discrimination. The subject is asked whether the second of 
two tones is higher or lower than the first. The items are made 
progressively more difficult by decreasing the difference in pitch 
between the pairs of tones. 

2. Loudness discrimination. The subject judges which of two tones 
is louder. : 

3. Time discrimination. One tone is presented for a longer period 
of time than another, The subject judges which of the two tones 


is longes 
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4. Rhythm judgment. The subject judges whether two rhythmic pat- 
terns are the same or different. 

5. Timbre judgment. The subject judges whether or not two tones 
are of the same musical quality. 

6. Tonal memory. Two series of notes are played. In the second 
series one of the notes is altered. The subject judges which of the 
notes is different. 

Scores on the subtests correlate near zero on the average with intelli- 
gence tests (17). The subtest scores are partly independent, with median 
intercorrelations ranging from .48 to .25 for different samples (17). Split- 
half reliability estimates for the subtests range from .62 to .88. Rhythm 
and timbre are the least reliable. If, as is often done, the six subtests are 
added to form one general measure, high reliability can be expected for 
the test. Except for large differences in scores, the subtests are not suffi- 
ciently reliable for considering differential aptitudes within the test. 

Scores on the Seashore test are affected very little by age. Similar norms 
are found for grammar school, high school, and adult populations. Al- 
though the research results are somewhat contradictory (17, 41, 44), it 
seems that scores are affected only slightly by musical training. These two 
findings taken together suggest that the Seashore subtests measure some 
basic aptitudinal functions which are possibly inherited. The larger ques- 
tion is whether the aptitudinal functions involved in the tests are of any 
importance in predicting musical accomplishment. 

An insufficient amount of research has been done with the Seashore test 
to speak with firmness about its predictive utility. Modest to small corre- 
lations have been found with grades in music classes and with teachers 
ratings of musical ability (12, 24, 31, 38). The test differentiates mod- 
erately well between students who complete specialized musical training 
and those who drop out. It is reasonable to think that at the level of spe- 
cialized music training most of the persons with poor ability in the Sea- 
shore type of measures will have already been eliminated. 

A number of persons have argued that the Seashore measures are not 
very similar to the skills that are involved in the actual production of 
music. The Seashore subtests measure certain types of sensory discrimina- 
tion which might be necessary for musical ability but not sufficient, Where 
the Seashore test would have its most important value is in helping 
parents decide whether their children would be likely to profit from 
extensive musical training. This would save considerable money and 
would keep the neighbors from having to hear little Susan grind away for 
years at an instrument she will never master. At present the Seashore test 
is difficult to administer below the age of ten, and the predictive validity 
of the test at younger ages is not known. 


Special Abilities 257 


Wing Test. The Wing Standardized Tests of Musical Intelligence (49) 
were designed to stay as close as possible to the skills involved in musical 
production and appreciation. Like the Seashore test, the Wing test uses 
phonograph recordings. The following seven functions are tested: 

1. Chord analysis. Judging the number of notes in a chord. 
2. Pitch change. Judging the direction of change of notes in a re- 
pe ated chord. 
3. Memory. Judging which note is changed in a repeated melodic 
phrase. 
4. Rhythmic accent. Judging which performance of a musical phrase 
has the better rhythmic pattern. 
. Harmony. Judging which of two harmonies is better for a particu- 
lar melody. 
6. Intensity. Judging which of two pieces has the more appropriate 
pattern of dynamics, or emphasis. 
7. Phrasing. Judging which of two versions has the more appropriate 
»hrasing. 
The first three pare measure complex sensory abilities. The other four 
concern the esthetic value of different compositions. The subtest scores 
are added to form one general measure of musical aptitude. 

The Wing test has received favorable response from teachers of music, 
who feel that the test covers many of the skills that are important in 
musical training. Little is known about how well the test can predict 
available criteria. The author (49) reports correlations of .60 and above 
between the test and teachers’ ratings of musical ability in three small 
groups. It is possible that the Wing test will prove to be a better differ- 
entiator of musical talent at higher levels of ability than the Seashore. 
The Wing test might then be useful in the guidance and selection of stu- 
dents who want to go on from some initial musical training to more 
advanced training, 

There are a number of other tests based on phonographically recorded 
tones and musical phrases. The Drake Musical Memory Test (13, 14) 
emphasizes the memory component, which appears in only one subtest 
of the Seashore and Wing batteries. The Drake memory test is different 
in that the items concern short musical melodies instead of groups of 
tones. The subject must determine whether two melodies are the same, 
and if not, whether the change has been made in the key, the timing, or 
in specific notes, The Kwalwasser-Dykema Music Tests (30) comprise a 
battery of 10 subtests. Six of the subtests are similar to those on the Sea- 
shore. The subtests cover much the same ground as the Seashore plus 
the ability to read musical notation and some components of musical 
appreciation. 


iil 
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Analysis of Musical Aptitude Tests. Not enough research has been done 
to say how well the current tests work. A particular problem is the dearth 
of adequate criteria of musical accomplishment. School grades in the 
history, techniques, and general knowledge of music are the most reliable 
indices. But these are not the same as artistry in musical production, 
Judgment of the actual mastery of musical instruments must necessarily 
be based on the impressions of teachers and other persons, and impres- 
sions of this kind usually have only modest reliability. Even if there are 
some difficulties in validating the instruments, much more research 
should be done to determine how well they work. 

The tests which were discussed in the previous sections are all, strictly 
speaking, tests of appreciation. That is, the subject is not required actually 
to play an instrument but only to listen and judge what he hears. How- 
ever, some of the complex judgments involved seem to underlie the skills 
that are needed in musical production. It is likely that other types of 
tests could be used in conjunction with the conventional measures to 
obtain a better estimate of musical ability. Motor skills are involved in 
playing most musical instruments, the piano being an outstanding exam- 
ple. Motor tests might be profitably used in the prediction of musical 
accomplishment. Although intelligence tests correlate very little with the 
available musical tests, this does not mean that they would be of no use 
in predicting musical accomplishment. It would be expected that intelli- 
gence and, more generally, the factors which underlie differential apti- 
tude tests would be useful in the prediction of course grades in musical 
curricula and in special music schools. 

It is likely that an individual’s interest in musical work will be as pre- 
dictive of later success as tests of the ability type. Two such interest tests 
are the Farnsworth Scales (18) and the Seashore-Hevner Tests for Atti- 
tude toward Music (43). The small amount of research that has been 
done indicates some promise for tests of musical interest. 


GRAPHIC ART 


McAdory Test. The field of graphic art testing has been dominated by 
a particular type of item, in which a “masterpiece” is compared with one 
or more altered versions of the same work. One of the oldest tests of this 
kind is the McAdory Art Test (34), which came out in 1929. The test 
contains pictures of 72 art works covering a wide variety of contemporary 
art forms, ranging from pictures of furniture and automobiles to works of 
art in museums. Four versions of each art work are given; these differ in 
shape, arrangement, shading, and use of color. The subject is required to 
rank-order the four versions in terms of his preferences. 

Items for the McAdory test were selected in terms of the judgments of 
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experts, including teachers, critics, and artists. Items were retained only 
if at least 64 per cent of the judges agreed on the ranking of the four ver- 
sions of each picture. A primary weakness of the test is its dependence on 
contemporary art values. For example, it is likely that the preferences for 
furniture, automobile design, and even paintings have changed since the 
test was constructed. 


Ficure 11-9. Illustrative items from Meier Art Judgment Test. (Reproduced by per- 
mission of Norman C. Meier.) 


Meier Test. The Meier Art Judgment Test (35) is by far the most 
widely used test of art appreciation. It also uses the altered-version type 
of item. The test differs from the McAdory in that only one alternative 
version is given for each original art work, and the items concern rela- 
tively timeless art masterpieces. The items are all in black and white. 
The altered version of each masterpiece is meant to destroy the esthetic 
organization. In a typical altered version, one figure is moved to the as 
in such a way as to change the balance of the painting (see Figure l- ). 

The initial selection of items was made on the basis of expert judg- 
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ments. Items on which there was high agreement among 25 experts were 
retained. The items were further pared down in terms of internal con- 
sistency statistics. Only those items showing a high correlation with the 

total score were placed in the final form. 
Split-half reliabilities for the Meier test range from .70 to .84 in rela- 
tively homogeneous groups of subjects. Scores correlate only negligibly 
with traditional measures of intelli- 


gence. Only a small amount of re- 

: search has been done to determine 

how well the test predicts available 

criteria. It has been shown that the 

test differentiates art students from 

non-art students and different art 

students in terms of the amount of 

training that they have had. A cor- 

relation of .46 was found with the 

grades of 50 art students (27). Cor- 
a) (b) 


relations ranging from .40 to .69 


were found with ratings of creative 

> @ art talent (9, 37). 
Graves Test. To remove the test 
* as much as possible from traditional 
and contemporary art values, the 
Graves Design Judgment Test (22, 
23) consists entirely of abstract de- 
Y wy signs (see Figure 11-10). Each test 
item consists of either two or three 

(a) (2) 


versions of the same basic design. 
Ficure 11-10. Illustrative items from The altered version or versions were 
Graves Design Judgment Test. ( Courtesy constructed to violate accepted es- 
of The Psychological Corporation. ) thetic principles. The judgments of 
d art teachers and art students were 
a to select the best 90 items from an original list of 150. Split-half reli- 
a ility estimates range from .81 to 93. Although the Graves test gives 
Promise of being a useful measure, only a small amount of empirical 
work has been done with the instrument. 
: Worksample Tests. A number of tests were designed to test how well 
individuals can actually produce graphic art works. 
Ee, of these is the Horn Art Aptitude Inventory (25, 26), which 
includes the following subtests (see Figure 11-11): 
l. Scribble exercise. Making outline drawings of 20 simple objects 


2i Doodle exercise. Making abstract compositions out of simple geo- 
metrical forms 
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3. Imagery. Working from a given set of lines to a completed compo- 
sition 
Other tests which are largely concerned with the production of art works 
are the Knauber Art Ability Test (29) and the Lewerenz Tests in Funda- 
mental Abilities of Visual Art (32). 


Ficurr 11-11. Sample item from the Imagery Test of the Horn Art Aptitude Inven- 
tory. The subject is shown only the lines in rectangle A from which he is to make a 
drawing. Examples of completed drawings are shown in B and C. (From C. Horn, 25; 
reproduced by permission of the American Psychological Association. ) 


—_—- 


(2) (4) 


The worksample tests must rely on the judgments of graders. Product 
scales are used, in which a particular drawing is compared to a standard 
set, Sample drawings are available for each score level. The grader gives 
a score in accordance with the apparent nearness in quality of the sub- 
ject’s drawings to the product scale examples. In spite of the apparent 
subjectivity of the scoring system, moderately high reliabilities are 
reported for tests of the worksample type (25, 26, 28, 29). Current evi- 
dence indicates that the worksample tests predict course grades as well 
as the appreciation tests and do a better job of predicting teachers’ ratings 
of creative ability (2,26). ’ 

Analysis of Graphic Art Tests. The difficulties of defining and measur- 
ing musical aptitude are magnified in the measurement of graphic art 
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aptitude. Since the graphic arts are more dependent on fashion, criteria 
of accomplishment are weaker; hence the underlying aptitudes are more 
difficult to determine. Unlike the sensory discrimination functions in musi- 
cal aptitude, the current measures of graphic art aptitude appear to 
depend heavily on training. Consequently they are of less use in the early 
guidance of prospective art students, where all tests of art aptitude would 
seem to have their most promising use. 

Current tests are biased toward certain cultural groups. For example, 
it was found (45) that much lower scores on the McAdory test are made 
by Navajo Indian children than by children in New York City, in spite 
of the fact that the Navajo culture has a highly developed art form of 
its own. The available tests appear in most cases to be clever and well 
designed, but the paucity of research which characterizes the testing of 
artistic abilities leaves many questions about how well the tests work in 
practice. Perhaps future factor-analytic studies of graphic art tests will 


lead to a better knowledge of the underlying functions and how they can 
best be measured. 
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CHAPTER 12 


Achievement Tests 


The achievement test is a natural outgrowth of the teacher’s own class- 
room examination. Now achievement tests are used more broadly to assess 
levels of skill in numerous endeavors in addition to school work—in mili- 
tary, industrial, and particularly in government civil service work. Most 
of the instruments discussed in this chapter are commercial tests which 
can be purchased and used in a wide variety of training programs. 

In terms of the effort spent in test construction and the number of tests 
that are used, achievement tests overshadow all other psychological meas- 
ures. They also have an immediate importance not shared by any other 
type of test. Achievement tests are helpful in assigning school grades, 
in planning school curricula, in remedial training, and in vocational guid- 
ance, Achievement tests are of sufficient importance to merit the work of 
hundreds of specialists, and the tests are used in such quantities as to make 
the development of a new instrument a sizable commercial venture. 
Numerous books (see 1, 2, 13, 17) are devoted either wholly or in part to 
the development and use of achievement tests. 

Aptitude and Achievement. Achievement tests are meant to measure 
accomplishment, as distinct from aptitude tests and intelligence tests, 
which are intended to measure the capacity for accomplishment. In some 
situations it is meaningful to distinguish aptitude and accomplishment 
even though it is difficult to measure them separately. : 

Achievement tests necessarily measure an interaction of more basic apti- 
tudes with the learning that takes place in specific training situations. As 
a group, children who do well in school work must have at least mod- 
erately high aptitude for such training, combined with personality traits 
and interests which lead them to accomplishment. However, it is not 
necessarily true that children who do poor work in school are lacking in 
aptitude. Some of them may be suffering from physical difficulties such as 
Poor vision, from disrupted personalities, and from economically impov- 
erished or unwholesome family environments. Children of this kind some- 
times show marked improvement in school progress after their physical 
and personal problems have been dealt with. 
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Validity of Achievement Tests. The achievement test is a primary 
example of an assessment as described in Chapter 4. The purpose of an 
achievement test is to measure an individual's accomplishment either in a 
specific unit of instruction or in a more general course of training. The 
achievement test might be a good predictor of later school grades or pro- 
fessional accomplishment, but this is not its primary purpose. \lthough it 
is hoped that achievement at one stage of training will be predictive of 
later success, it would be unfair to judge achievement tests on this basis 
alone. If predictions of behavior were the standard for achievement tests, 
it would be very difficult to decide which later behaviors to predict and 
even more difficult to measure them. For example, what future behavior 
should be predicted by course grades in ancient history? The individual 
might master the course content quite well yet take no further courses 
in history nor enter an occupation for which knowledge of ancient history 
is an obvious prerequisite. 

The achievement test is valid if it faithfully represents the important 
behaviors in a performance situation. This is referred to by some people 
as “circular validity” or “intrinsic validity.” It was said previously that the 
validity of achievement tests and assessments generally is dependent on 
content representativeness. Validity in this sense means that the achieve- 
ment test should concern a broad sample of the information and problems 
that are thought to be important in a course of training. 

Construction of Achievement Tests. The primary steps in constructing 
an achievement test are those which were outlined in Chapter 8 in the 
section entitled “Construction of Assessments.” Preparatory to the con- 
struction of the test, an outline should be made of the skills which the 
particular training course is intended to teach. Because of the funds and 
personnel that are available for constructing most large-scale achievement 
tests, the outline can usually be spelled out in more detail than is possible 
in the individual teacher-made course examination. The outline should be 
carefully compared with the texts, drills, and lecture content of the train- 
ing course. If the achievement test is to be used widely in different 
schools, the outline should be discussed with numerous teachers. Some 
of the major headings in an outline of reading skills are as follows: 

1. Perception of written material. The child must first be able to see 
the individual letters and words before he can read adequately. It is also 
necessary for the child to have a proper balance of eye muscles and to 
coordinate eye movements in reading. 

2. Vocabulary. The next essential is for the child to have a knowledge 
of the words that appear in written material at his level of schooling. 
3 Reading speed. Although, within limits, reading speed is not as 
important as the understanding of what is read, the very slow reader is 
handicapped in his school work. 
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4. Retention, Before any further understanding is possible, the child 
must be able to remember some of the factual content of what he reads. 

5. Critical reaction. At the highest level of reading accomplishment, the 
child should be able to synthesize what he reads with past knowledge, 
analyze the implications of what is read, and react critically to the 
materia! 

6. Auxiliary skills. These include a familiarity with tables of contents, 
the meaning of footnotes and references, acquaintance with library filing 
systems, and a familiarity with dictionaries and reference sources. 

Each major heading in the outline should be filled out in much more 
detail than the example above (see 5, chap. 7; 15, pp. 272-273 for more 
complete outlines of reading skills). 

After the outline is completed, the next step is to compose test items 
for each part of the outline, Considerable material has been written on 
the composition of achievement test items (see 2; 15, chap. 4; 17). Some 
of the principal points to consider were outlined in Chapter 8. After a 
large collection of items (an item pool) has been made, a number of 
judges should be used to estimate the clarity, relevance, importance, and 
representativeness of the items for the particular achievement area. In 
Chapter 8, we described some methods of item analysis which are useful 
in selecting those items from an item pool which will produce a better 
achievement test. 

The Content of Achievement Tests. Because nearly all standard 
achievement tests are “objective” examinations which employ some variant 
of the multiple choice type of item, they have been criticized as measur- 
ing only the memory for simple factual details. Most teachers agree that 
the memory for details is not the important thing to be gained from a unit 
of instruction. Here in this book, for example, the reader will soon forget 
the names of many of the tests, the specific formulas, and the numerous 
Statistical results which are cited. But it is hoped that the total impact of 
the book will be such as to provide the reader with a set of principles, a 
logical scheme, and a way of thinking about human measurement which 
will not fade with time. 

The content of multiple choice tests was discussed in Chapter 8, the 
viewpoint expressed there being that objective tests often do focus on 
simple factual details but that this need not be the case. There is con- 
siderable empirical evidence and practical experience to show that care- 
fully constructed multiple choice tests can adequately measure the under- 
standing of complex principles, the ability to draw conclusions, and the 
ability to make inferences from one set of facts to another. The standard- 
ized achievement tests often embody more important content than the 
teacher's own multiple choice examinations. This is because there are 
More funds, facilities, and expert services available for the construction 


268 Tests and Measurements 


of standardized achievement tests. Because the standardized achieve- 
ment tests successfully use multiple choice items, teachers should not 
assume that all examinations should be of that kind. The poorly con- 
structed multiple choice test is often less effective than an essay exam- 
ination. Specific examples will be shown later in the chapter of how 
achievement tests can adequately measure the high-level understanding 
of different subject matters. 

Achievement Test Scores. The most popular scoring procedure is the 
straight multiple choice form in which the question, or stem, is supplied 
with anywhere from two to more than a dozen alternatives, one of which 
is designated the “correct” response. A number of variants of the multiple 
choice form are useful in special circumstances. One of these is the match- 
ing type of item, in which each of a number of names, for example, is 
matched with one of a number of accomplishments or events. A typical 
item is as follows: 


Columbus 


. invented the compass 
Balboa . sailed around the world 
Magellan . discovered Australia 


. discovered America 
. invented the steamship 
discovered the Pacific Ocean 


Ths! GO oe 


The matching question usually has more names or terms than the three 
in the example above. An important rule is that there should be consid- 
erably more events or definitions than persons or terms. Otherwise, “guess- 
ing” plays an important part in the responses. The matching question is 
most useful when there are many names, definitions, and specific facts to 
be tested. In such cases, it provides a handy way of condensing a number 
of separate questions into one large item. 

Another variant of the multiple choice type of item is to require the 
subject to rank the alternatives from most to least correct. In many tests, 
the “correct” alternative is not completely correct and the “incorrect” 
alternatives are not equally incorrect. It is reasonable then to think that 
more information could be obtained from each question by having all of 
the alternatives ranked. Although there is something to say for this pro- 
cedure, it offers a number of drawbacks. Principally, it is difficult to con- 
struct alternatives for an item in such a manner that they can be logically 
ordered from most to least correct, In many questions, the alternatives are 
either true or false, and there is little basis for choosing among the false 
alternatives. Also, items of this kind take much more time to administer, 
and, consequently, the range of items that can be used is severely 
restricted if, say, only an hour is available for the test. In addition, such 
items are difficult and time consuming to score. 
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Another variant of the multiple choice item is to have the subject mark 
all of the alternatives which he is sure are incorrect. Like the rank- 
ordering of alternatives this should logically provide more information 
about the subject’s knowledge than the straight multiple choice item. The 
subject's score consists of the number of incorrect alternatives marked as 
incorrect. The major problem with this type of item is that it introduces 
personality factors to an unknown extent. Some people are more willing 
to take chances than others and consequently to gamble on more of the 
incorrect alternatives. Such predispositions will influence test scores, mak- 
ing them partly measures of personality characteristics, which should 
logically not enter into achievement tests. 

Although it is important to continue exploring new types of test items, 
the weight of the argument is in favor of the straight multiple choice 
form for most purposes. In most cases, it allows the presentation of more 
items in a shorter time, is the easiest type of item for subjects to under- 
stand, and is easier to construct and score. It is usually found that more 
complex scoring systems give results which correlate very highly with the 
straight multiple choice form. 

Achievement Test Norms. Because the purpose of most achievement 
tests is to compare either a pupil or a unit of instruction with more gen- 
eral educational standards, it is important to have truly representative 
norms. The norms available for many of the standard achievement tests 
are generally superior to those used with other kinds of tests. In many 
ways, the task of obtaining norms is simpler with achievement tests than 
with predictor tests. The normative population to be used with predictor 
tests is sometimes difficult to define. The normative populations for 
achievement tests are more easily specified, e.g., all of the sixth-grade 
students in the United States. Also, because the normative population is 
contained in specific training programs, subjects are more available for 
testing, and procedures of sampling can be more effectively applied. 

The question often arises as to whether national norms or more local 
norms should be employed with achievement tests. This depends on how 
the tests are used. If, for example, achievement test scores are being used 
to select students for advanced classes in a particular city, it would be 
wiser to use local norms. If the purpose is to compare the amounts and 
kinds of training in different schools across the country, national norms 
should be used. ] 

Classifications of Achievement Tests. There are a number of different 
ways of classifying achievement tests in terms of their uses. A primary 
distinction is made between tests designed to measure one subject only, 
such as the knowledge of geography, and batteries designed to Gs : 
broad range of subject matters, which might include a geography su T 
along with subtests for literature, history, mathematics, and other sub- 
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jects. They will be referred to respectively as unisubject and multisubject 
achievement tests. The distinction is a matter of degree because some of 
the unisubject tests, particularly those for language and mathematical 
skills, divide the subject matter into a number of separately tested parts. 

A distinction is often made between survey achievement tests and diag- 
nostic tests. The survey tests are meant to measure how much a person 
knows. The diagnostic tests are meant to tell why a person is or is not 
achieving adequately. Survey tests are the typical kind used to measure 
the mastery of different subject matters. Diagnostic tests are used pri- 
marily in guidance and remedial work to offer clues as to why children 
are doing poorly in certain subject matters. The distinction between sur- 
vey and diagnostic tests is also a matter of degree. There is an unfortunate 
tendency for people to refer to all unisubject tests that measure a number 
of components as diagnostic tests. Examples of more truly diagnostic tests 
will be discussed in the coming sections. 


USES FOR ACHIEVEMENT TESTS 


Assignment of Grades. Teacher-made tests are used more often than 
standardized achievement tests for the assignment of course grades in 
elementary school, high school, and college. The advantage for the 
teacher-made test is that it is tailored to the particular unit of instruction. 
It is also the teacher’s responsibility to decide what grades should be 
given. The use of achievement tests for the assignment of grades would, 
in many cases, be unfair because of the differences in content of courses 
bearing the same name. In order to keep a semblance of order in the 
recording of school progress, it is essential that courses with similar names 
have at least a core of knowledge in common. For example, if an indi- 
vidual receives a passing grade in plane geometry, it is expected that he 
has a familiarity with certain facts and concepts. However, it is reasonable 
to expect the instructor to emphasize certain topics more than others and 
to include material that another instructor might not discuss. Therefore, 
the standardized achievement test used for the assignment of school 
grades might penalize particular students to some extent. 

Standardized achievement tests are more often used to assign course 
grades in special school programs, in government, military, and industrial 
training. For example, achievement tests are frequently used in assigning 
grades to participants in technical training schools in the Armed Forces. 
It is sensible to use achievement tests to assign grades in special school 
programs for a number of reasons. The course content is usually more 
standardized than that in, say, most colleges. For example, in a course 
on the “maintenance of small arms” there are rather definite rules to be 
taught. It is expected that all instructors will teach very much the same 
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information, and it is not uncommon for all instructors to use a standard 
teaching syllabus to insure uniformity of instruction. The course content 
of many special training programs remains relatively fixed over moderate 
periods of time. Therefore, a standard achievement test may be usable for 
the assignment of course grades over a period of several years or more 
with only minor revisions in test content. In many special training pro- 
grams, the instructor is not sufficiently trained in pedagogic techniques 
to assign course grades effectively. Although the instructor may be a very 
good salesman, infantry sergeant, or machinist and also may be a good 
teacher, he may have little knowledge of testing procedures and the grad- 
ing of stuclents. Consequently, it is more meaningful to assign grades on 
the basis of achievement test scores. 

Classification of Individuals. Achievement tests are used more fre- 
quently to test the individual’s mastery of a whole range of subject matter 
than to grade his performance in an individual unit of instruction. Multi- 
subject achievement tests are often administered to graduating grammar 
school students preparatory to entering high school. Although the student 
may have good marks in individual courses, it is still uncertain what his 
total educational experience has been. He may have forgotten large 
amounts of the subject matter, or the particular school may have under- 
emphasized certain subject matters which are necessary in high school. 
Students who do poorly on parts of the achievement test, in grammar, for 
example, can then be classified into a group needing special preparatory 
instruction during the freshman year of high school. 

Achievement tests are often used to classify students into special courses 
or curricula within grammar school, high school, or college. There is a 
growing trend toward tailoring instruction to the individual. A widely 
used procedure is to assign students to one of several sections on the basis 
of over-all progress shown either by previous course grades or by achieve- 
ment tests. An even more logical method of classification, if resources 
permit, is to gear the instruction not only to the student’s over-all level of 
achievement but to his own strong and weak points as well, One of the 
multisubject achievement tests is useful for that purpose. 

Counseling and Remedial Training. Achievement tests are useful to the 
school psychologist and the clinical psychologist in understanding the 
school difficulties of particular children. The practice in some schools of 
Promoting children from one grade to the next even if they do very poorly 
often permits the child with little aptitude for school work to get far over 
his head in difficult subject matters. This will frustrate him as well as 
lower the morale of teachers and students. Achievement tests can be used 
to indicate the generally low level of such a student's school progress, and 
this in turn can suggest placement of the student in a training program 


More suited to his ability. 
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The diagnostic achievement tests are designed primarily to help in the 
counseling and remedial training of students who show a difficulty in 
mastering certain school topics, and particularly to help understand the 
underlying reasons for reading difficulties. If it is found, for example, that 
the child understands what he reads but that he reads very slowly, specific 
remedial training can be undertaken to increase reading speed. 

The cause of truancy or poor conduct in the classroom is often indicated 
by achievement tests. Such disruptive behavior is sometimes related to a 
low level of achievement. The student takes out his frustration on the 
teacher and his classmates. A very high score on an achievement test can 
also be indicative of a student's difficulties in conduct. The very bright 
student is often bored by the relatively low level of the curriculum and 
turns to misbehavior or truancy. 

Vocational Guidance. Achievement tests in combination with course 
grades are useful in helping the student to choose future programs of 
school work or vocational training. They are particularly useful in helping 
students to decide whether or not to go on from high school to college, 
and if so, which kind of college training to undertake. If, for example, 
the student is considering premedical training in college, he should show 
achievement in biology, science, and have a generally high level of 
accomplishment. There are exceptions of course—the student who does 
poorly in high school but goes on to high accomplishment in college. But 
the achievement test scores at least indicate the amount of improvement 
the student will need to show in order to reach minimum standards for 
subsequent schooling. 

Two students with the same over-all level of accomplishment at the end 
of high school may have different strong and weak points in particular 
subjects. The measurement of such differential achievement with multi- 
subject tests provides a basis for deciding about careers and future 
training. 

Achievement tests used in vocational guidance must be evaluated as 
predictors, not as assessments. Test scores are used essentially to forecast 
how well an individual is likely to perform later. Using achievement tests 
in vocational guidance is relatively safe when the question concerns going 
on from high school to college or from college to graduate work. For 
example, the individual who performs poorly in the language arts in col- 
lege is not likely to meet with success in the graduate school specialties 
of journalism and English, It is less safe to use achievement test scores to 
advise students to go into vocations such as sales work, carpentry, and 
forestry. If the predictive validity of the achievement test is not known for 


the vocations in question, the vocational guidance is as good or as bad as 
the counselor's judgment, 


Achievement Tests 273 


Measuring the Effectiveness of Instructional Techniques. The primary 
reason that many teachers are unfriendly to standardized achievement 
tests is that they are viewed as tests of teaching ability. For example, if it 
is found that the students in a particular school do poorly on the biology 
section of an achievement test in comparison to the scores made by stu- 
dents in neighboring schools, it suggests that the biology teacher is not 
doing a good job. There are three major variables contributing to the 
grades that students make on an achievement test: the initial aptitude of 
the students, the effectiveness of the training program, and the nature of 
the test. These three factors should be carefully considered before blame 
or praise is placed on the teacher alone. 

The teacher cannot bring students to high achievement if they are lack- 
ing in initial aptitude. Large differences are found in the achievement test 
scores of children in different schools, varying with geographic region, 
ethnic background, and socioeconomic status. e 

It is unfair to use achievement test results to measure the effectiveness 
of teachers if the test itself is poorly constructed. If the test is poorly 
standardized or if the content is generally trivial, it is not the teacher’s 
fault that students do poorly. 

If the achievement test is well constructed and if account is taken of 
the over-all level of achievement for a group, the teacher cannot entirely 
escape the blame for a low level of achievement in a particular subject 
matter. The low level of achievement may be due in part to crowded 
classrooms, lack of proper equipment, and to a poorly run school, but it 
cannot be denied that the teacher’s individual effectiveness is also promi- 
nently involved. 

The practice of using achievement tests to measure the effectiveness of 
instruction is disturbing to some teachers. They argue that the achieve- 
ment test puts a “strait jacket” on instructional methods, forcing the 
teacher to tailor the instruction to the kinds of material in the achieve- 
ment test. This may discourage the teacher from experimenting with new 
methods of instruction and with new aspects of the subject matter. It is 
not unheard of for a teacher to coach students on materials so similar to 
those in the achievement test that the effectiveness of the measuring 
instrument is ruined. 

Aside from the use of achievement tests to measure the effectiveness of 
teachers, achievement tests have a real purpose in measuring the effective- 
ness of different kinds of instruction. For example, the question might 
arise as to whether or not a film series would help in the instruction of 
geography. If students who have the film series do better on an ee 
ment test than those who participated in a straight lecture course, this 
offers one type of evidence in favor of the films. Achievement tests are 
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helpful in determining the effectiveness of many such variants of teaching 
methods. 

Achievement tests are useful in the organization of school curricula, 
Decisions have to be made about the level at which students can master 
particular subject matters and the best sequence for presenting different 
courses. It might be desirable to offer a course in French earlier in train- 
ing than had previously been the practice. If the students who take the 
course at an earlier age do about as well on an achievement test or not 
much worse than the students who take the course later, this suggests that 
the course could be given earlier in training. 

Selection of Individuals. Although the primary purpose of achievement 
tests is to assess current performance, they are also useful as predictors of 
future behavior. Achievement tests can be used to select college students, 
office workers, insurance salesmen, and many others. When achievement 
tests are used in this way, they stand on new logical grounds. They are 
then predictors and are as good or as bad as their correlations with meas- 
ures of future school or yocational success. The achievement test that is 
a well-constructed instrument for the assessment of performance in a 
training program may be no good at all as a predictor of future success in 
particular vocations. 

Achievement tests are best for predicting success in vocations which 
entail activities similar to those in the training programs for which the 
tests were originally designed. For example, an achievement test of mathe- 
matical skills developed for high school students will more likely be bet- 
ter as a predictor of success in college mathematics courses than it will be 
as a predictor of success in politics. Because most achievement tests 
concern school subject matter, they are most useful for predicting success 
in future educational efforts. Achievement test scores of graduating high 
school students are generally as effective if not more effective than intel- 
ligence tests for predicting college grades. When the two kinds of tests 
are combined with high school grades, good predictive efficiency is usu- 
ally obtained, 

Achievement tests are used prominently in the selection of persons for 
specific jobs and professions, particularly in the civil service program. A 
typical predictor battery is that for stenographers, in which the tests con- 
sist of achievement in typing, shorthand, and other skills pertinent to 
office work. Achievement tests have been used in this way to select indi- 
viduals for a wide variety of jobs. 


SURVEY ACHIEVEMENT TESTS 


Standard achievement tests are available for many different subject 
Petters and there are usually competing forms from which to choose. 
e following sections will consider several examples in each major area 
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of achievement testing. The person who is interested in acquiring achieve- 
ment tests for particular purposes should consult the reference sources 
cited in Chapter 18 and should obtain catalogues from the major test 
publishers listed in Appendix 1, 

The multisubject survey tests are generally favored over the separate 
unisubject tests. That is, rather than choose separately constructed tests 
for geography, history, biology, and the other subject matters, it is gen- 
erally better to use a multisubject battery in which the subtests have all 
been constructed and standardized alike. Some ambiguity in comparing 
students on different unisubject tests often comes from the different test 
construction methods used by different commercial firms. An even greater 
advantage of the multisubject battery is that the norms are usually 
obtained for all tests on the same subjects. 

There are several exceptions to the preference for multisubject batteries 
over unisubject tests. Several subject matters are sufficiently complicated 
within themselves to require extensive individual treatment. This is so of 
mathematics and reading. Each of these is a unisubject domain in name 
only. There are various functions that need to be tested in each, In order 
to test the extensive skills within each of these general subject-matter 
areas, it is necessary to use such a variety of test materials as to make it 
impractical in most cases to administer them along with tests for geog- 
raphy, biology, and other subject matters. Another instance in which the 
unisubject test is preferred over the multisubject batteries is in the testing 
of special skills such as handwriting and musical accomplishment. The 
testing of these skills usually requires novel testing methods and scoring 
procedures which are sufficiently different from the typical multiple 
choice test to require separate treatment. 

Survey Reading Tests. Reading is by far the most important skill to 
be tested in school children. The first several grades are devoted largely 
to teaching the child how to read. From that point on, school work is 
almost synonymous with the comprehension of written material. If a child 
is having difficulty in reading and understanding what is read, it is impor- 
tant to find it out as soon as possible so that remedial instruction can be 
undertaken. If a school had its choice of only several tests that could be 
used in a testing program, it would be wise to use, along with one of the 
better intelligence tests, a comprehensive test of reading skills. 

The distinction between diagnostic and survey tests is particularly 
tenuous in the measurement of reading skills. Almost all of the reading 
tests measure a number of related functions and, as such, offer some help 
in diagnosing reading difficulties. The reading tests cited in the coming 
section entitled “diagnostic tests” are placed there rather than nd 
because they are more specifically designed to offer clues about under- 
lying reasons for reading difficulties. 
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A typical survey test of reading is the Gates Basic Reading Tests ( 3). 
Separate sections are used to measure ability in: (a) reading to appreci- 
ate general significance; (b) reading to predict outcomes of given events; 
(c) reading to understand precise directions; and (d) reading to note 
details. Illustrative items are shown in Figure 12-1. Age and grade norms 
are available for grades 314 through 8. All of the parts are timed, the 
amount of time varying with the grade up to 40 minutes in all. Subtest 


Ficure 12-1. Sample items from the Gates Basic Reading Test, grades 3% through 8, 
1957 edition. (Reproduced by permission of the Bureau of Publications, Teachers 
College, Columbia University. ) 


Type GS. Reading to Appreciate General Significance 

Smart as horses are, they do not always know what is good for them. They some- 
times want to gallop at top speed, but a good rider will never let them do it. A 
horse running at his top speed is out of control, just as a high-powered car would 
be. Unless a horse has been trained as a race horse, top-speed running puts great 
strain on his delicate legs. 

Draw a line under the word that tells what kind of running is bad for a horse. 
slow top-speed easy controlled moderate 


Type ND. Reading to Note Details 
Next morning she awoke and found herself in a beautiful room. The walls were 


covered with silken curtains, There were two mirrors made of pure silver. The bed 
was made of ivory. The coverings were made of silk and velvet. By her bed lay a 


dress and a pair of sli 
with diamonds. 
Where did the girl find herself? 
barn room garden store 
What were the mirrors made of? 
silver gold pearl silk 
What were on the slippers? 
rubies pearls opals diamonds 


ppers. The dress was made of silk. The slippers were covered 


reliabilities range from .80 to .96 for different grades. Although separate 
scores can be obtained from the four subtests, it is better to average them 
as an over-all measure of reading ability. The correlations among the four 
subtests are too high to permit the use of profile difference scores. Except 
for large differences in subtest scores, score differences would be largely 
due to chance. 

Two other widely used survey reading tests are the Nelson-Denny 
Reading Test: Vocabulary and Paragraph (12); and the Cooperative 
English Test (19), Many of the multisubject batteries also include read- 
ing subtests. 
Survey Batteries for Elementary School. The multisubject batteries find 
eir widest use in the testing of elementary school children. Achieve- 


th 
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ment tests have a special importance there because so much of the child's 
education lies ahead of him, and it is important to learn his strong and 
weak points as early as possible. Because of the greater uniformity of 
subject matter in elementary school, it is easier to compose a comprehen- 
sive battery of achievement tests for this level than for higher levels of 
education. 


Ficunre 12-2. Sample items from Metropolitan Achievement Tests, Advanced Battery. 
(Reproduced by permission of World Book Company.) 


Vocabulary 
She replies means she—1. complains 2. depends 3. fills 4. answers 
Arithmetic fundamentals 
What per cent of 24 is 9? 
Arithmetic problems 
Ned bought one-half dozen roses for $1.68. At that price, what did one rose 
cost? 
English: Part II Punctuation and Capitalization 


Put in the capital letters and commas, periods, and other punctuation marks that 
have been left out. 


Does Carl want the candy or the fruit he prefers candy but he likes fruit too. 
Literature 
Friday was a faithful servant of— 
l. Tarzan 2, Huckleberry Finn 3. Robinson Crusoe 4. Robin Hood 


Social Studies: History and Civics 
Western Europe became interested in exploration because— 


1. the feudal system disappeared. 2. many schools were opened. 
3. of a desire for new trade routes. 4, the Church favored it. 
Science 


In the lungs, the blood gains a supply of— 
l. argon 2. nitrogen 3, carbon dioxide 4. oxygen 


A typical series of achievement tests for the elementary level is a 
Metropolitan Achievement Tests (9). The series includes four separate 


batteries ranging from Primary I Battery for the first grade A e 
vanced Battery for grades 7 and 8. The Primary I Battery inclu oh ne 
teading tests and one numerical test. Each of the three eens Ai ne 
Contains ten subtests: reading, vocabulary, arithmetic fundamenta i a 
metic problems, history, geography, English, literature, meas, ie 
ing. The forms are intended to be power tests (see Figure 12- B Ta 
trative items), A battery can be given in from one to four hours. Ei 
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four or five equivalent forms are available for each b ittery. Split-half 
reliabilities for the subtests range from .80 to .97 for different batteries, 
Two other widely used multisubject batteries are the Stanford Achieve. 
ment Tests (10) and the California Achievement Tests (16). 

High School Achievement Tests. Comprehensive achievement test bat- 


teries for high school are more difficult to construct than are those for ele- 
mentary school. Whereas all students tend to study many of the same 
topics in elementary school, they often diverge widely in their elective 
topics in high school. Also, the curriculum spreads out to numerous topics 


such as mental hygiene and civics, which are usually not offered in ele- 
mentary school. Consequently, the available achievement test batteries 
tend to capitalize on the core subject matters that underlie high school 
training, particularly literature, mathematics, general science, and some 
aspects of social science. The scores that a student makes on high school 
achievement batteries should be considered in the light of the special 
course of study which he has undertaken and the grades made in par- 
ticular units of instruction, 

A typical high school achievement battery is the Cooperative General 
Achievement Tests (20). The three subtests concern broad knowledge of 
the social sciences, mathematics, and natural science. Each subtest is 
divided into two parts, the first part being concerned with a knowledge 


Ficure 12-3. Sample items from the Cooperative General Achievement Tests. (Re- 
produced by permission of the Educational Testing Service. ) 


Social Studies: Terms and Concepts 
Most children in the United States attend schools that are supported primarily by— 
1. contributions from churches, 
2. the federal government, 
3. private endowments, 
4. local and state taxation. 
5. tuition paid by pupils. 


A reciprocal trade tariff is one in which— 
1. both political parik compromise, 
2. the rates are low, 
3. exclusion of foreign goods is sought. 
4. all sections of the nation benefit nearly equally, 
5. two nations modify their rates on each other’s goods. 


A typical daily newspaper derives the largest part of its income from— 
. subscriptions and newspaper sales, 

» advertising by commercial establishments. 

payment for space by news-gathering agencies, 

+ want ads. 


» donations by political parties, labor unions, ete, 
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Ficure 12-4. Sample items ° from the Watson-Glaser Critical Thinking Appraisal. 
(Reproduced by permission of World Book Company.) 


EXAMPLE. STATEMENT: “We need to save time in getting 
there, so we'd better go by plane.” 


ProposeD ASSUMPTIONS: 


1. Going by plane will take less time than going by some 


other means of transportation. (It is assumed in the TEST 2 
statement that greater speed of a plane over other means ASSUMPTION 
of transportation will enable the group to get to their des- MADE fans 
Uination in less time.) -sasse + setem une 

2, It is possible to make plane connections to our destina- 


tion. (This is necessarily assumed in the statement, since, 
in order to save time by plane, it must be possible to go 
by plane.) .sesereeseeneoo ein stele ae aa 


3. Travel by plane is more convenient than travel by train. 
(This assumption is not made in the statement—the state- 
ment has to do with saving time, and says nothing about 
convenience or about any other specific mode of travel.) 


EXAMPLE. Some holidays are rainy. All rainy days are bor- 


ing. Therefore— TEST 3 


{i CONCLUSION 
L. No clear days are boring. (The conclusion does not fol- 


low, as you cannot tell from these statements whether or 
not clear days are boring and some may þe) seen A 


ro 


. Some holidays are boring. (The conclusion necessarily 
follows from the statements, since, according to them, the 
rainy holidays must be poring.) » «as semaur 


3. Some holidays are not boring. (The conclusion does not 
follow from the statements even though you may know 
that some holidays are very pleasant.) ..-+++ser0557* . 


EXAMPLE. Should all young men go to college? 


L. Yes; college provides an opportunity for them to learn 


school songs and cheers. (This would be a silly reason 


for spending years of one’s life in college.) b TEST 5 
5 Ae ARGUMENT 
2. No; a large per cent of young men do not have enough STRONG WEAK 


ability or interest to derive any benefit from college train- 
me (If this is true, as the directions require us to assume, 
it is a weighty argument against all young men going to 
college, ) 


- No; excessive studying permanently warps an individual's 
Personality. (This argument, although of great general 
importance when accepted as true, is not directly related 
to the question, because attendance at college does not 
hecessarily require excessive studying.) eee 


° The explanations in parentheses do not appear in the items of the test proper 
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of terms and concepts and the second part with comprehension and inter 
pretation (see Figure 12-3 for illustrative items). Other high school 
achievement batteries are the Iowa Tests of Educational Development 
(11), The Essential High School Content Battery (8), and the California 
Achievement Tests (16). 


Achievement Tests for Higher Education. Considerable ingenuity in 
test construction is required to measure the products of college training, 
graduate study, and professional work. Students can take such different 
sets of courses that there is little overlap in training. Therefore, achieve- 


ment tests for higher education are best aimed at particular lines of spe 
cialization, such as engineering. It is more hazardous to employ survey 
batteries that are meant to compare all students in their general college 
progress. A systematic battery for the diverse courses that students can 
take would be a very large-scale instrument. Instead of having only one 
subtest for “science” it would be necessary to have separate subtests for 
physics, chemistry, and biology, Other related subjecis which can be 
lumped together at lower levels of training should be tested separately 
at the higher education level. 

One of the most widely used achievement test batteries for the college 
level is the Graduate Record Examination (Educational Testing Service). 
The battery includes tests of verbal and mathematical ability and subject- 
matter tests for physics, chemistry, biology, social studies, literature, and 
fine arts. The separate tests are sufficiently long and well standardized to 
provide reliable scores. The battery is usually administered at the end of 
college training. It has proved useful in guidance of college students, 
selection of graduate students, and research on college curricula. 

In addition to their use for the measurement of progress in college pro- 
grams, achievement tests can be used to test some of the over-all results 
of higher education, Instruments of this kind are usually referred to as 
tests of critical thinking. A typical example is the Watson-Glaser Critical 
Thinking Appraisal (18). The five parts of the test are concerned with 
the ability to recognize assumptions, make inferences, deduce conse- 
quences, interpret statements, and evaluate arguments (see Figure 124 
for illustrative items). All of the materials are presented as multiple 


choice items. The test illustrates the measurement of complex thought 
processes with objective items. 


DIAGNOSTIC ACHIEVEMENT TESTS 


; as was said previously, whether or not an achievement test is “diagnos 
tic” is a matter of degree, Some achievement tests make particular efforts 
to measure the constituent skills involved in a subject-matter area, an 
these more rightly earn the name of diagnostic tests. 
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Diagnostic Reading Tests. Diagnostic tests have assumed their most 
important place in the measurement of reading skills. This is because 
reading is a very important part of school achievement and because the 
attributes which underlie reading skill are complex. A typical diagnostic 
reading measure is the Iowa Silent Reading Tests (6, 7). The series 
includes an elementary battery for grades 4 to 8 and an advanced battery 


for high school and college. The advanced battery includes the following 
subtests: 
Test |. Rate of reading and comprehension of prose 
Test 2. Directed reading of prose to answer particular factual ques- 
tions 
Test 3. Poetry comprehension 
Test 4. Vocabulary in different content areas 
Test 5. Sentence meaning 
Test 6. Paragraph comprehension 
Test 7. Location of information using an index 
Another approach to the measurement of reading skills is through the 


use of standard oral passages. The child is asked to read aloud a series of 
graded passages. The Standardized Oral Reading Paragraphs is a form 
which has been widely used (4). While the pupil reads each passage, the 
examiner uses a standard code to indicate the number and kinds of mis- 
takes which are made. The examiner looks for mistakes like the follow- 
ing: 

. Mispronounced words 

. Mispronounced vowels 

. Repetitions: reading the same word or phrase over before going on 
. Omissions: leaving out a word or phrase 

. Substitutions: saying a different word from that in the passage 

. Insertions: adding words 

. Reversals, such as saying “no” when the word is “on” 

n addition to the noting of various kinds of mistakes, the examiner has 
an opportunity to observe the child while reading. This often provides 
some clues about reading difficulties, such as stuttering, extreme shyness, 
and visual or auditory defects. 

Diagnostic Tests of Mathematical Skills. Mathematics is second to read- 
ing as an area in which diagnostic tests are most needed. A widely used 
battery is the Compass Diagnostic Tests in Arithmetic (14), comprising 
a series of group tests suitable for grades 2 through 8. Addition, subtrac- 
tion, multiplication, and division are broken into a number of underlying 
skills, and each is tested separately. The test might show, for example, that 
4 particular child habitually misplaces the decimal point in long division, 
has difficulty in carrying figures from one column of addition to another, 
and fails to line up properly the steps in multiplication. 
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Appraisal of Diagnostic Achievement Tests. There is undoubtedly a 
great need for diagnostic tests to help understand particular weaknesses 
in school achievement. In order to have an adequate diagnostic test it is 
essential, among other things, to obtain reliable score profiles. The meas- 
urement of differential achievement is the object of the diagnostic battery, 
Unfortunately many of the diagnostic batteries employ subtests which 
have low reliabilities and high correlations with one another. Conse- 
quently, score differences within the batteries are unreliable. In order to 
have the broad coverage of skills that is necessary for diagnostic purposes 
and in order to make each subtest long enough to be individually reliable, 
a thorough diagnostic test must necessarily be a long, time-consuming 
measure to apply. 
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CHAPTER 13 


Opinions, Attitudes, and Interests 


Democracies stir and change in reaction to what the people think and 
feel. Politicians avidly inspect the current opinion poll results to learn 
the public sentiment toward arms reduction, farm subsidies, and tax 
policies. Commercial concerns spend millions to learn the kinds of house- 
hold goods which are desired by the average housewife. Teen-age whims 
make and break entertainers in rapid succession. Prejudice of one group 
toward another presents a barrier to the democratic ideal and a hundred 
years of rancor and protest result. What the individual wants to do in 
life determines in good measure the vocation he enters. As a partial conse- 
quence, the nation finds itself with a shortage of teachers, nurses, and 
engineers and a surplus of farmers, lawyers, and actors. In these and many 
other ways, opinions, attitudes, and interests are important factors in 
everyday living, 

Opinions, attitudes, and interests are prominent influences in the indi- 
vidual’s social adjustment. If the husband likes Beethoven and the wife 
prefers Sinatra, if one is a Republican and the other a Democrat, if one 
is prejudiced against Negroes and the other is not, disagreement and 
hostility are hard to avoid. The individual’s family and friends have their 
own attitudes and opinions, and the choice between conformity and con- 
flict is difficult to make. Personalities can disrupt when interests do not 
match abilities or when strong attitudes conflict with one another. 

The four previous chapters were primarily concerned with human 
ability, with how much a person knows or how well he can solve prob- 
lems. This and the following chapters will be primarily concerned with 
personal reactions, or what are variously called social traits, noncognitive 
attributes, and habitual performance. These human characteristics are 
often included in the more general term “personality,” but it is helpful to 
distinguish attitudes, opinions, and interests from the kinds of personality 
attributes which will be discussed in later chapters. 

The devices which are used to study attitudes, opinions, and interests 
are often referred to as “tests,” but they are not tests in the same sense as 
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a vocabulary test or a mechanical comprehension test, In an aptitude test 


the subject cannot control his score, except to cheat or to make a low 
score on purpose. The most socially approved behavior is to make the 
highest score possible, and the individual's ability limits the score which 
is made. This is not at all the case with most measures of attitudes, 
opinions, and interests. The measurement devices are more properly 
referred to as questionnaires or inventories. They are standard procedures 
for obtaining information from the individual about himself. Most of them 
are dependent on the individual's knowledge of himself and his willing- 
ness to tell what he knows. 

The individual knows many things about himself which would be diffi- 
cult to learn in any way other than asking him directly. However, the in- 


dividual is not always an accurate observer of his own behavior nor a 
willing reporter. This is particularly the case when the questions concern 
potentially embarrassing issues such as sexual adequacy, honesty, and 
personal courage. Depth psychology has taught us that the individual 
often hides unpleasant things from himself through the mechanisms of 
projection, repression, and rationalization. Even if the individual is aware 
of personal deficiencies, he is likely to hide this information from others. 
Emotionally laden issues predominate in the personality inventories and 
cause the subject to “fake” responses rather than give accurate answers. 
Inventories for opinions, attitudes, and interests are sometimes affected by 
the individual’s need to give socially acceptable responses but not nearly 
so much so as are the personality inventories. (The problem of “faking” 
will be discussed at greater length in the next chapter. ) 

Relationships Among Opinions, Attitudes, and Interests. There are no 
sharp dividing lines among the three types of instruments to be discussed 
in this chapter. They are all inventories dependent on the individual’s 
reporting of what he thinks and feels. However, it is necessary to distin- 
guish them as well as possible, because they are constructed and used 
somewhat differently. i 

It is easiest to begin by distinguishing interests from opinions and atti- 
tudes. Interest inventories are meant to measure what the individual does 
and does not like to do. A typical interest item is, “Do you like to work 
out of doors?” to which the subject answers either “yes” or “no.” Although 
interest inventories can be constructed for activities of all kinds, they are 
primarily concerned with activities that distinguish jobs and professions 
from one another. 

Opinions and attitudes concern how the individual reacts to the people, 
institutions, and ideas in the world about him. The term “opinions is used 
more often to refer to judgments and knowledge, whereas the term “atti- 
tudes” is more connotative of feelings and preferences. Opinions are usu- 
ally more verifiable than attitudes. That is, the opinion “There are more 
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Protestants than Catholics in the United States” is open to direct inquiry 
and verification. Using the term “opinion” in the stricter sense to refer to 
verifiable statements, it is possible to determine whether or not an indi- 
vidual’s opinion is true or false. An example of an attitudinal statement 
is, “I dislike capital punishment.” This is a statement of feeling, and it 
does not make sense to inquire whether it is true or false. 

Although it is possible to distinguish opinions and attitudes with 
extreme examples, the two are usually mixed to some extent in most ques- 
tionnaire studies. For example, the statement “Negroes have lower intelli- 
gence than white people” is open to study and as such is an opinion; but 
it also relates to how an individual feels about Negroes and is conse- 
quently an attitudinal statement as well. Our feelings about most persons 
and ideas usually influence our judgments to some extent. Because of the 
overlap between the two kinds of reactions, the term “opinions” is often 
used to refer to both feelings and judgments, or the term “attitudes” is 
often used to cover the two. 

Aside from the psychological functions that are involved in studies of 
opinions and attitudes, the terms are usually employed to refer to differ- 
ent kinds of studies. The term “opinions” has become associated with the 
opinion poll, which involves the sampling of responses from a large and 
broad cross section of the population. The questions are usually very 
simple, such as, “Will you vote for Jones or for Smith?” and in most 
cases require only a “yes” or “no” answer. If two or more questions 
are employed, they are either unconnected or not connected in a logical 
manner so that they add together to form a total score. Respondents are 
interviewed on the street while they wait for the bus, or the housewife 
answers the questions while warming the baby’s bottle. 

Attitude-measurement devices are more like laboratory instruments. 
They often involve complex rating methods, and it is usually necessary to 
employ a dozen or more items. Consequently, attitude scales are seldom 
administered in mass to thousands of people, but are more often employed 
to study particular groups of people. Whereas the separate items in an 
opinion poll do not necessarily add up to a more general measure, this is 
the essence of the attitude-measurement techniques. The purpose in the 
study of attitudes is to locate the individual on a continuum, or scale, 


which ranges from very unfavorable to very favorable in respect to the 
object being rated. 


OPINION POLLS 


x Opinion polling, like most other things, appears quite simple to the 
individual who makes his living doing something else. On the surface it 
seems that all that is necessary is to ask some questions of a large number 
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of persons. In fact, the adequate measurement of opinions is an exacting 


business; and unless the polling is undertaken with considerable care, 
the results may be very misleading. Considering the number of ways in 
which opinion polls can go astray, it is surprising that the polling results 
have been as accurate as they have during the last thirty years. 

The opinion poll is an outgrowth of the “straw vote,” whose purpose 
is to sample voting sentiment before an election. One of the earliest 
recorded “straw votes” was that of the Harrisburg Pennsylvanian in 1824. 


Reporters went out to inquire whether the citizens in the surrounding 
region would vote for Henry Clay, Andrew Jackson, John Quincy Adams, 
or William H. Crawford for President. The straw vote grew steadily as 
an aspect of journalism and gradually extended to topics other than vot- 
ing behavior. By the early 1930s a number of magazines, particularly 
Literary Digest and Woman’s Home Companion, were involved in ex- 
tensive polling activities. 

A now famous mishap focused public interest on opinion polls and 
pointed out the need for more systematic methods. The Literary Digest 
made a very bad estimate of the voting sentiment in the 1936 presidential 
election (see 1, chap. 10). The magazine predicted a Landon victory 
with 370 electoral votes, but Roosevelt won 523 of a possible 531 votes. 
This calamitous mistake drove the magazine out of business. 

A look at how the Digest poll operated will point up some lessons about 
opinion polling. The poll was conducted entirely by mail, with 10 million 
ballots being sent out for the 1936 election, of which less than 2% million 
were returned. The voters who reply in this way are usually a select 
group, overweighted with persons from the upper-income and upper- 
educational levels. People who reply to mail questionnaires usually have a 
more direct interest in the election than do those who fail to reply. In 
this case it was the upper-income individuals who were protesting against 
Roosevelt. 

Worse than the dependence on mail-in ballots, the Digest used very 
poor sampling procedures. The mailing list was obtained from telephone 
directories and automobile registrations. The people in 1936 who had 
telephones and automobiles were, as a group, much higher on the socio- 
economic scale than those who did not. At that time the lower socioeco- 
nomic group voted heavily Democratic and the higher socioeconomic 
group voted heavily Republican. The Digest predicted that Roosevelt 
would get only 40.9 per cent of the popular vote, whereas he actually got 
60.2 per cent. : 

The Literary Digest fiasco taught pollers a lesson that they will never 
forget. A careful look at the Digest polling results for the years before 
1936 would have shown the unrepresentativeness of the results. In the 
election predictions for the years of 1916, 1920, 1924, and 1928, the polls 


288 Tests and Measurements 


showed average plurality errors of 20, 21, 12, and 12 per cent respectively 
(1, p. 180). 

The first systematic polling organization in this country began in 1935, 
when George Gallup formed the American Institute ol Public Opinion. 
Knowing the nature of the Digest sample, he was able to predict very 
closely the amount by which their results would be in error before the 
ballots were even mailed out. Gallup’s organization and others have grown 


in strength since that time. Other leading polling agencies are the Fortune 
Survey (Elmo Roper), the Crossley Poll, the National Opinion Research 
Center, and the Office of Public Opinion Research (Princeton Univer- 
sity). 

During the last twenty years the polling agencies have grown in size 
and influence and have branched out into many areas besides the pre- 
diction of elections. Polling activities increased sharply during the 
Second World War as a means of keeping a constant ear on public 
opinion. This provided the government with a considerable amount of 
information about the sentiment for joining the war, reaction to price 
controls, faith in our allies, and many other pertinent issuc:. 


TABLE 13-1. PREDICTIONS or VOTING PERCENTAGES FoR WINNING CANDIDATES 
COMPARED WITH THE NATIONAL VOTE 


(Adapted from Albig, 1, p. 212) 


Poll Election 
1936° 1940° 1944} 1948} =: 1982 t 
(Roose- (Roose- —_ (Roose- (Eisen- 
velt) velt) velt) (Truman) hower) 
Literary Digest 40.9 ne ie ee ee 
Fortune (Roper) 61.7 55.2 53.6 41.5 57.0 
American Institute of 
Public Opinion (Gallup) 53.8 52.0 51.5 47.3 54.0 
Crossley Poll 53.8 52.9 52.0 47.3 52.8 
National vote 60.2 54.7 53.8 52.3 55.3 


i e S O OOOO O oM 


v 
Percentages of total vote including that for minor candidates. 
t Percentages of two-party vote only. 


t Percentages of two-party vote with proportional division of “undecided” votes. 


Market research is the major activity of polling agencies today. Com- 
mercial firms spend millions of dollars to learn what the public thinks 
about new products and what type of person is most likely to buy an 
article, Special polls such as Hooper, Nielsen, and Trendex report the 
audience reaction to radio and TV programs, and the results have 4 
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tremenlous influence on the pattern of entertainment in America. Opin- 
ion polling has also become a new type of journalism. Polls are conducted 
solely to stir the public interest, using questions like, “Should the wife 
or the Lusband manage the family finances?” and “Would you vote for 
a wont President?” Poll results now appear regularly in hundreds of 
Americ n newspapers. The election predictions have remained the show- 
case of opinion polling, and the predictions have increased in accuracy 
(see Table 13-1 for some election predictions). 

Pop ation Sampling for Opinion Polls. The most accurate way to 
determine public opinion is to ask questions of every individual in the 
population. A complete census of this kind is extremely laborious and 
expensive. The 1940 United States government census cost $50,000,000, 
and morc recent census studies were even more expensive. Consequently, 
it is necessary to deal with samples of the population as a way of estimat- 
ing how the total public reacts to an issue. 

The results of an opinion poll based on a sample of the population are 


useful only to the extent that they closely resemble the results that would 
be obtained from the whole population in question. This will be the case 


when the sampling is precise and accurate, in the senses in which the 
terms il] be used here. A sampling procedure is accurate if it is un- 
biased: i.e., if the sample is not overbalanced with persons of one kind 
or another. If 20 per cent of the individuals in the nation hold an opinion, 


the r ; of an accurate sampling procedure will, as the number of 
indivicvs!s polled grows larger and larger, approach the 20 per cent 
figure. An example of a biased, or inaccurate, sampling procedure was 
given on the previous pages in respect to the 1936 election predictions 
by the Literary Digest. If they had gathered larger and larger numbers 
of persons from telephone directories and automobile registrations, they 
would still have ended up with the wrong prediction. 

The precision of a sample is determined directly by the number of 
persons sampled. If the sampling procedure is accurate, the true popula- 
tion results will be approximated more and more closely as the number 
of individuals in the sample is increased. Statistical formulas can be used 
to determine the likelihood that the population results will differ from 
the sample results by particular amounts (the formulas are discussed in 
Appendix 6). For example, if in an unbiased ( accurate) sample of 1,000 
persons, 55 per cent say that they would vote for candidate A, the odds 
are less than 1 in 1,000 that fewer than 50 per cent of the entire popula- 
tion would, if polled, say that they would vote for A. If we could estimate 
things in our daily life, such as whether or not it will rain tomorrow, with 
odds of only 1 in 1,000 of being wrong, we would feel very comfortable 
in making decisions. ; 

The cost of polling tends to go up arithmetically with the number 
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sampled. That is, if it costs $2,000 to poll 400 people, it will cost about 
$4,000 to poll 800 people. The precision of the sample increases in pro- 
portion to the square root of the number of persons polled. That is, it 
takes a sample of 1,600 persons to double the precision of a 400-person 
sample. There is a point of diminishing returns, where the increasing 
precision becomes unimportant in comparison to the increasing cost, 
This is why most large-scale polls use samples that generally range be- 
tween 3,000 and 5,000 persons. With sample sizes this large, the effect of 
imprecision alone is not likely to make the results different by more than 
a percentage point or two from the actual population values. 

The erroneous results sometimes obtained in opinion polls are seldom 
due to lack of precision (sample size). Literary Digest in the election of 
1936 received over 2 million straw votes. Precision is a necessary starting 
point for an adequate sample, and except for a few studies which are 
conducted on less than 100 persons, most polling results have high pre- 
cision. But even if the sample is highly precise, the sample may be 
biased and provide misleading results. 

Sampling Techniques. Sampling procedures were discussed briefly in 
Chapter 3 in regard to the gathering of testing norms. Here the subject 
will be treated in more detail because of the direct importance of sam- 
pling techniques for opinion polls (see 20 for a thorough discussion of 
sampling methods). The population to be sampled in opinion polls may 
be, as the term is popularly intended, the entire adult United States 
citizenry, or it may be limited to the population of a certain state or 
locality. Sometimes the population being sampled consists of all the 
members of a particular group, such as all of the members of a particular 
labor union, or all of the Catholics in the United States. 

The most obvious and in many ways the ideal method of sampling is 
to draw a number of persons completely at random from the population. 
If, for example, a sample of 1,000 persons from a labor union is being 
studied, the individuals could be randomly chosen from union rolls. 
(Tables of random numbers which can be found in many statistics books 
are useful for this purpose.) Random sampling is a completely accurate, 
or unbiased, procedure. i í 

If there are obvious subgroups in the population such as plumbers, 
carpenters, and machinists, there is a way of improving the random 
sample. This can be done by ensuring that the subgroups appear in the 
sample in proportion to their numbers in the population. That is, if 20 
per cent of the union members are plumbers, it should be ensured that 
20 per cent of the sample is composed of plumbers. Each of the subgroups 
—plumbers, carpenters, and so on—is referred to as a stratum. Ideally the 
proper number of persons should be chosen randomly within the 
stratum. That is, in the example above, the specified number of plumbers 
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could he randomly drawn from lists of plumbers only. A sample obtained 
in this way is called a stratified-random sample. 


The random and the stratified-random are the most accurate types of 
sampling available for general use. However, they are too expensive to 
undertake in many situations, particularly in national polling. A satis- 
factory approximation can be obtained by an area sample. In this case, 
dwellings are sampled instead of persons. A city can be laid out into a 
large number of square areas. A number of the areas can be drawn at 
random, and a number of addresses can be randomly chosen within each 
area. Opinion pollers can then contact as many designated persons as 
can be {ound and will answer the questions. The area sample is ideal for 
measuring the opinions of the individuals in one city. If the sample is to 
represent a regional or national population, there is the additional chore 


of choosing representative towns, cities, and rural areas to study. 

Even the area sample is too difficult to obtain in many studies, particu- 
larly in the weekly polls on current issues. It is necessary to resort instead 
to what is called a quota sample. The quota sample makes no effort to 
select people randomly. Instead, persons are deliberately chosen because 
of their individual characteristics. The effort in the quota sample is to 
construct a miniature of the entire population in terms of known demo- 
graphic characteristics. The characteristics which are most often con- 
trolled are geographic region, race, urban-rural, age, sex, and income. 
The individual interviewer, located in, say, Memphis, Tennessee, is told 
the number of people that he should poll in each of the above categories. 
He must. for example, ensure that about 10 per cent of his sample is 
Negro and the remainder white, that 30 per cent of the subjects are from 
tural districts, that half are above forty years of age, and so on for the 
other demographic characteristics. Little effort is made to balance out 
the cross categorizations of persons, so as to ensure that half of the urban 
subjects are women, and so on for more complex combinations. Some of 
the categorizations depend on the interviewer's judgment to an extent. 
Even though the distributions of demographic characteristics for the in- 
dividual interviewer may only roughly match the percentages of such 
persons in the United States population, these inconsistencies tend to 
average out as the results are combined for many different interviewers. 

Spot-check Samples. In many studies it is desirable to get an indication 
of national opinion without going to the time and expense of a large- 
scale national poll. Sometimes used for this purpose are small local 
samples, which are selected in such a way as to share approximately the 
same demographic characteristics as the total population except for the 
restriction on geographic region. Cantril (4, chap. 12) reports a series 
of studies in which the results from small local samples are compared 
with national polls, For example, in one study the local sample consisted 
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of 264 persons all of whom were obtained within 40 miles of Princeton 
University. The interviewers were instructed to choose persons in the 
upper-, middle-, and lower-income brackets in the ratios 1:4:5. No sys- 
tematic effort was made to balance other demographic characteristics, 
The same questions that were asked of the local sample were incor- 


porated in a large-scale national poll. Very similar results were obtained 
from the local sample and the national poll (see Table 13-2 for illustra- 
tive results from one of the questions). However, as Cantril warns, the 


representativeness of results from small local samples depends very much 
on the questions which are asked. If the questions are related to distinctly 


local issues and feelings, the results are likely to be quite misleading, 
Such would be the case if opinions about segregation are sampled in 
Mississippi only or if opinions about international affairs are sampled 


from persons in Wisconsin only. 


TABLE 13-2. Comparison OF THE RissuLTs OBTAINED FROM AN OPINION 
QUESTION ADMINISTERED TO A SMALL LOCAL SAMPLE AND “O A LARGE 
NATIONAL SAMPLE 


(Adapted from Cantril, 4, p. 158) 


———— | 


Question: If President Roosevelt made a radio talk explaining that it would be neces- 
sary to ration gasoline to reduce everybody's driving by as much as one- 
third from what is being driven now in order to save rubber, would you be 
willing to see this done? 


Small New Jersey sample, per cent National poll, por cent 

Yes 88 Yes 87 

No 8 No 8 
Qualified Qualified 

answer 4 answer 1 

No opinion 0 No opinion 4 


a oMi 


Opinion Panels. One of the larger expenses in opinion polling is the 
gathering of a new sample of persons for each questionnaire which is 
used. A practice which has developed in recent years is that of forming 
a panel, a representative sample of persons who are periodically inter- 
viewed or mailed questionnaires. There are some distinct practical 
advantages to using a panel and also some likely risks. Once the panel 
is formed, the gathering of opinions is much less expensive than if new 
subjects have to be obtained for each study. Because the panel will be 
used over and over, it is not unduly expensive to obtain a more repre- 
sentative (accurate) sample than is generally possible otherwise. Al- 
though it is difficult to handle opinion studies by mail the first time 
individuals participate, the members of a panel soon learn enough about 
the use of questionnaires so that studies can be handled entirely by mail. 
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There is a great saving in the use of this method over the use of many 
person.! interviewers. 


The pvincipal danger in using a panel is that the continued experience 
with «0 stionnaires is likely to change people’s viewpoints or at least 
change what they say about their viewpoints. This is more likely to 
happen when the majority of the questions are on closely related topics, 
and particularly when the questions: concern the knowledge of issues. 
For ex sople, if a series of studies is made on popular opinions about 


health, ihe panel members are likely to think more about health problems 
than formerly, read material relating to health issues, and be more aware 
of health information in the mass media of communications. Conse- 
quently, the panel members will tend to become more knowledgeable 
than the average person and perhaps also will change their feelings about 
many of the issues. Although the panel may have originally been a 
representative sample of persons, the group can become unrepresentative 
through the very act of participating in a series of studies. 

Not à great deal is known at the present time about how panels change 
over time. The practical advantages of using a panel merit more research 
on this problem. It is safer to use a panel when successive questionnaires 
concern relatively unrelated topics. Also, it is good to retire periodically 
some oi the panel members and sample new persons to replace them. In 
this way there is a gradual turnover in panel membership and, conse- 
quently, less likelihood of building up an unrepresentative sample of 
respondents. 


PRINCIPLES OF OPINION POLLING 


Even if the requirements of an accurate and precise sample are met, 
there are still many ways in which opinion polls can go wrong. Careful 
attention must be paid to the wording of questions, the way in which 
opinions are obtained, and the biases of the interviewer. 

Interviewer Bias. The interviewer is supposedly an impartial recorder 
of opinions who should have no influence on the responses which are 
obtained. A series of studies (see 4, chap. 8) shows that this is far from 
the case. It has been found that interviewers who hold a particular 
opinion themselves tend to find more members of the general public with 
the same opinion. In one study the following question was asked of 
interviewers (4, p. 108): : 

“Which of these two things do you think is more important for the 
United States to try to do— 

To keep out of war ourselves, or ire 
To help England win, even at the risk of getting into war our- 
selves?” 
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After the interviewers’ own opinions were obtained, the interviewers 
asked the same question of a national sample. Comparisons were then 
made of the results obtained by interviewers favoring keoping out of the 
war and favoring helping England. The results are shows in Table 13-8, 


TABLE 13-3, Comparison or INTERVIEWERS’ AND RESPONDENTS’ OPINIONS ON 
THE SAME QUESTION 
(Adapted from Cantril, 4, p. 109) 


Opinions of respondents 


Interviewer favored " j Per cent 
reported by interviewer 
Helping England Favored helping England 60 
Favored keeping out 10 
Keeping out Favored helping England 44 
Favored keeping out 56 


{S O l U 


There is a striking difference in the responses obtained by the two groups 
of interviewers, showing that interviewers themselves affect the responses 
obtained from the public. A breakdown of the differences in terms of size 
of city showed that the largest bias occurred in small towns and in rural 
areas. No difference between the two groups of interviewers occurred in 
cities with over 100,000 persons. Because interviewers in small towns 
often know many of the inhabitants, it is likely that they tend to pick 
respondents who hold opinions similar to their own. Another possibility 
is that the townspeople know the interviewer and his general opinions 
and that they give similar opinions in order to please the interviewer. 
Interviewers with different opinions on unions, political candidates, and 
other topics tend to get different responses in polling. 

` Whatever the subtle factors working in the relationship between the 
interviewer’s opinions and the responses which he obtains, this form of 
bias should be controlled as much as possible. Knowing that bias of this 
kind occurs more strongly in small towns, it is wise to use an outsider to 
sample opinions rather than a local interviewer. Also, interviewers should 
be trained to discount their own opinions in the sampling and interview- 
ing of subjects in so far as that is possible. One interesting control is 
that used by Gallup in the polling of election sentiment. He employs half 
Republican and half Democratic interviewers to help balance out the 
effect of the interviewers’ own opinions. 

The type of person the interviewer is can bias responses as much as 
the opinions which he holds. This occurs most prominently when the 
interviewer is of a different race or ethnic group than that of the 
respondent. Cantril (4, pp. 114-115) reports a study in which Negro and 
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white interviewers asked the same questions of Negro respondents. The 
results are shown in Table 13-4. The results show that strikingly different 
opinious are obtained by the two kinds of interviewers. In this case it is 
reasonable to think that the Negro interviewers were obtaining more 
frank opinions. It would be expected that large differences in responses 
like those shown in Table 13-4 would occur only on questions relating 
to race or ethnic relations. 

TABLE 13-4. RESPONSES OBTAINED FROM NEGRO RESPONDENTS BY WHITE 


AND NEGRO INTERVIEWERS 


(Adapted from Cantril, 4, p. 116) 


Ouest Negro interviewers White interviewers 

uestion 

à reported, per cent reported, per cent 

Would Negroes be treated Better 9 Better 2 
better or worse here if Same 32 Same 20 
the Japanese conquered Worse 25 Worse 45 
the U.S.A.? No opinion 34 No opinion 33 

Would Negroes be treated Better 3 Better l 
better or worse here if Same 22 Same 15 
the Nazis conquered the Worse 45 Worse 60 
U.S.A! No opinion 30 No opinion 24 

Do you think it is more Beat Axis 39 Beat Axis 62 
important to concentrate Make democ- Make democ- 
on beating the Axis, or racy work 35 racy work 26 
to make democracy work No opinion 26 No opinion 12 


better here at home? 
Ree ee eee 


Interview Bias. In addition to the bias which is caused by the inter- 
viewer, the interview situation itself can distort responses. The interview 
is a social situation where at least the interviewer is present to hear the 
respondent’s opinions, When interviews are conducted on the street, 
there may be one or several friends of the respondent who hear the 
responses. People tend to express more socially acceptable opinions when 
they are in the presence of other individuals than when the opinion 
questionnaire is filled out in privacy. Cantril (4, chap. 5) reports a com- 
parison of the results obtained from a secret response with those from an 
interview. One sample of subjects was asked the questions in the usual 
interview situation, A matched sample filled out the questionnaire by 
themselves in their own homes. Results from one question are shown in 
Table 13-5. The most socially acceptable response at the time was to 
favor helping England as much as possible, and consequently, the major- 
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ity of the respondents gave that opinion in the interview situation. When 
respondents were allowed the privacy of their own homes, they were 
more inclined to express a less socially acceptable opinion. Interview 


bias is more likely to occur when there is strong social pressure about 
an issue. In the example above, the government, most arms of the mass 
media of communications, as well as many organizations were backing 


aid for England. 


TABLE 13-5. Comparison OF RESULTS OBTAINED FROM AN |NTERVIEW AND 
FROM A SECRET RESPONSE TO AN OPINION QUESTION 


(Adapted from Cantril, 4, p. 79) 


Question: Do you think that the English will try to get us to do most of the fighting 


for them in this war, or do you think they will do their fair share of the 
fighting? 
Will do their fair share Will try to get us to do Don’t know, 
of the fighting, per cent their fighting, per cent per cent 
Interview 57 25 18 
Secret response 42 42 16 
Difference 15 IT 2 


NE EE 


No-opinion Respondents. One of the difficulties that besets the opinion 
poller is that of interpreting the results from respondents who refuse to 
answer questions, have no opinion, or who say they have not made up 
their minds. When a sizable number of the respondents, say, over 20 per 
cent, express no opinion, it makes the over-all results somewhat question- 
able. The customary procedure is to divide the no-opinion response 
proportionally to the responses of those persons who do express an 
opinion. For example, if in an election poll, 45 per cent of the people 
say they will vote for candidate A, 40 per cent say they will vote for 
candidate B, and the remaining 15 per cent say that they have not made 
up their minds, the 15 per cent would be divided proportionally. This 
would give an estimated vote of 52.9 per cent for A and 47.1 per cent 
for B. In spite of the arguments that have been made against the pro- 
portional division of no-opinion respondents, the method has often worked 
well in practice. 

Wording of Questions. There are usually a number of different ways to 
ask a question, and the responses will often be affected by the wording 
which is chosen. The first rule in opinion polling is to use the simplest 
terms possible. Otherwise, a large segment of the public will not under- 
stand the question. 

Prestige names and concepts should be avoided in questions unless 
they are pertinent to the issues, Table 13-6 compares the results which 
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were obtained from asking a question using Roosevelts name with the 
same question asked without identifying the issue with Roosevelt (4, p. 
39). Emotionally toned words such as Communist, radical, and fascist 
will also affect responses. 


TABLE 13-6. COMPARISON OF THE RESPONSES TO AN OPINION QuEsTION UsING 
RooseveLt’s NAME AND wirHout His NAME 


(Adapted from Cantril, 4, p. 40) 


Question. Form A: It has been said recently that in order to keep the Germans out 
of North and South America we must prevent them from captur- 
ing islands off the west coast of Africa. Do you think that we 
should try to keep the Germans out of the islands off the west 
coast of Africa? 


Question, Form B: President Roosevelt said recently that in order to keep the Ger- 
mans out of North and South America we must prevent them 
from capturing the islands off the west coast of Africa. Do you 
think that we should try to keep the Germans out of the islands 
off the west coast of Africa? 


With Roosevelt’s name, Without Roosevelt’s name, 


per cent per cent 
Yes 56 50 
No 24 21 
No opinion 20 29 


Care rust be taken not to word a question in such a way that several 
different interpretations are possible. Respondents (4, p. 6) were asked 
the question “If the German Army overthrew Hitler and then offered to 
stop the war and discuss peace terms with the Allies, would you favor 
or oppose accepting the offer of the German Army?” A further inquiry 
into how people interpreted the question showed that the meaning varied 
widely for different individuals. Some persons even thought they were 
being asked whether or not they were in favor of Hitler being over- 
thrown. 

Summary of Principles. The previous sections have discussed only the 
most prominent ways in which opinion polls can go wrong (for more 
detailed discussions, see 1 and 4). The intention here is not to cast doubt 
on all opinion polling. Much of the research has been of a high quality, 
and the results have proved valid on many occasions. The lesson to be 
learned is that the opinion poll should be conducted by experts, and that 
little faith should be placed in haphazard polling attempts. Although 
much is now known about different kinds of biases that enter into 
opinion polls and ways to correct them, there will always be unknown 
sources of bias to distort particular results. Only by repeated studies with 
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different interview techniques, sampling procedures, and wording of 
questions, can exact measures of public opinion be obtained. 


CHARACTERISTICS OF OPINIONS 


The results of an opinion poll provide only a surface indication of 
people's thoughts. Much more needs to be known about opinions: how 
they develop, their stability, the intensity with which they are held, and 
their relationship to overt behavior. The practical and direc tly profitable 
task of measuring opinions has received the widest attention by polling 


agencies, Only during the last twenty years have inquiries been under- 
taken into the psychological background of opinions. 
Reliability. Opinion questions are open to measurement error just as 


all test items are. If the respondent is asked the same question on two 
occasions, he might give different answers. This may reflect a real change 
in opinion or only an element of chance in the way in which he responds. 
If the respondent knows little about the issue or has no £rm conviction, 
his response might be little more than a mental coin flip. Unreliability 
can come from interviewers who interpret responses differently, acci- 
dentally record a “yes” when the respondent says “no,” or mix up the 
answers to different questions. Statistical errors in recording and ana- 
lyzing the data can add additional unreliability. 

Because each question in an opinion poll is usually analyzed separately, 
rather than added with others as is done on most tests, the reliability 
of the results is dependent on the separate questions. Only a small amount 
of research has been done on the reliability of opinion questions. The 
available data indicates fair to good reliability for different questions, 
with per cents of agreement often above 80 for repeated interviews with 
the same questions (see 4, chap. 7). 

The unreliability in opinion questions tends to reduce the difference 
between response categories. That is, if a question is to be answered 
either yes or no, whatever unreliability is present will bring the over-all 
percentages of yes’s and of no’s closer to the 50 per cent mark. If a ques- 
tion were completely unreliable, the expectation is that exactly 50 per 
cent of the answers would fall in each of the two categories. In the same 
way, any unreliability in opinion questions will reduce the difference 
between the responses of any two groups, such as in a comparison of 
labor and management opinions. Consequently, it can be expected that 
the raw results from an opinion poll are conservative estimates of the 
differences between response alternatives and between the responses by 
different groups of people. 

Stability. Even though an opinion might be reliably measured at one 
point in time, it may change over a short period of time. This seems to 
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have occurred in the 1948 Presidential contest between Truman and 


Dewey. The voting sentiment apparently shifted to Truman in the last 
few weeks before the election. Partly because the polling agencies did 
not sample opinion during the period immediately before the election, 


they mispredicted and lost considerable public confidence for several 
years thereafter. 

Large shifts in opinion are more likely to occur on issues which are 
being discussed widely in the mass media of communications. Opinions 
are also more likely to change when the individual has no first-hand 
experience with the issue and must rely on secondary sources of informa- 
tion. Opinions generally change more easily on issues which do not affect 
the individual in an immediate personal way. One critical incident can 
often have a marked effect on public opinion, such as the Japanese attack 
on Pearl Harbor. Public opinion often changes markedly when a highly 
respected person advocates a particular viewpoint. 

Knowledge of Issues. Two individuals who give the same response to 
an opinion question may do so with very different amounts of knowledge 
about the issues. An individual may agree that it would be a good idea 
for the government to build and operate a dam on the Snake River but 
have no idea where the Snake River is or have any conception of the con- 
sequences of the government action. In many cases, an individual's opinion 
may be based on incorrect information rather than lack of information. 
Whenever the polling question is at all complicated, it is useful to give a 
short information quiz about the issues along with the usual polling ques- 
tions. It is sometimes found in this way that the more informed people 
have different opinions from the uninformed people, which leads in turn 
to suggestions as to how certain segments of the public can be educated 
about an issue. If the opinion poll results are being used in an advisory 
way to help in forming the policy of an institution, more faith can usually 
be placed in the sagacity of the opinions of people who know something 
about the issues. 

Strength of Conviction. Another dimension of opinions is the strength 
with which viewpoints are held. One person may say “yes” to the ques- 
tion “Should we help rearm Germany” and feel very strongly about it, 
very sure that it is the proper thing to do. Another individual may also 
answer “yes” but do so with considerable indecision and lack of sureness. 
Respondents are often asked to rate the strength of their conviction after 
responding to a polling question. One procedure is to ask the individual 
to endorse one of several statements which best describes the strength 
of his conviction, such as the following (see 4, p. 57): 

Tam not at all strongly convinced on this matter. 


I guess it is the best thing to do. 
Tam absolutely convinced this is what ought to be done. 
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It is important to know the strength of a responden:’s conviction for 
several reasons. When the strength of conviction is low, the individual is 
more open to outside influence and more likely to change his viewpoint. 
People who hold beliefs strongly usually exert more in \ience in social 
groups and in civic affairs than individuals whose beliefs are less strong. 
A hard-core minority of people who strongly hold an opinion can often 


win over the less certain larger group. 

Validity. The validity of opinion polls depends on the way in which the 
results are used. Many studies intend only to gauge the public sentiment, 
to discover what people say about issues. The opinion poll is then valid 
if it is free from major sources of bias like those mentioned previously in 
this chapter. Sometimes the results of opinion polls are used to predict 
how individuals will behave later. Election polls intend to predict how 


individuals will vote, and most market research studies intend to predict 
what individuals will buy. In these cases, the predictive validity of 
opinion poll results can be measured in terms of how well they actually 


forecast future behavior. Experience has shown that the cavcfully executed 
opinion poll can often predict future behavior with a high degree of 
accuracy. ` 

The predictive validity of opinion polls depends very much on the issues 
involved. If there are strong social pressures to hold one viewpoint or 
preference over others, the results of an opinion poll may serve as a poor 
predictor of how people will actually behave. A case in point is the story, 
probably apocryphal, concerning the study of preferences for new auto- 
mobiles conducted at the end of the Second World War. Most people said 
that they favored utilitarian, economical, low horsepower, inexpensive, 
unadorned automobiles. Being suspicious of the results, the polling agency 
undertook another poll in which each person was asked, “What type of 
automobile does the average person want?” The dominant picture ob- 
tained in the second poll was of a chrome-plated, leather-upholstered, 
high horsepower, many-gadgeted palace on wheels, The results obtained 
from the initial study would, as time has shown, have served very badly 
in forecasting the type of automobile that people would actually purchase 
during the following decade. This also illustrates the need to use a variety 


of questions concerning an issue rather than to place complete faith in 
one question only, 


ATTITUDE SCALES 


Attitudes are predispositions to react negatively or positively in some 
degree toward an object, institution, or class of persons. What negative 
and positive reactions are depends in part on the particular culture. 
However, it is rather universal to consider certain actions as negative and 
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positive, or as evidence of likes and dislikes. A negative attitude toward 


another person is manifested in wanting to see bodily harm done to the 
individual, wanting to be removed from his presence, wishing that he 
would suffer a financial loss, wanting him to lose positions of prominence, 
and so on. The negativeness and positiveness of attitudes are intuitive 
concepts which are difficult to define concretely. 

There are many ways in which a person can react negatively or posi- 
tively, singing from a refusal to buy merchandise from a Jewish business- 
man to joining the Ku Klux Klan. Verbal reactions are the most widely 
studied «ttitudinal reactions. People are asked either to agree or disagree 
with stuicments like, “The government should stop letting Italians enter 


this country” and “We should do everything possible to make Italian 
immigrants feel at home in this country,” or some more complex method 
of rating is used. Verbal reactions are studied more than other kinds of 
behavior because of the ease with which they can be handled in research 
investigations. It is much simpler to have subjects react to a list of state- 
ments in a questionnaire than to observe how individuals behave in their 
daily lives. Although interesting new types of data might be found by 
studying how people actually behave, the enormous labor and expense 
which «re entailed have permitted only a few investigations of this kind. 
All of the instruments which will be discussed in this section concern 
what people say about their feelings. 

The object in the measurement of attitudes is to locate each person at 
some point on a continuum, or scale, which ranges from “strong negative 
attitude” through “neutral” to “strong positive attitude.” On the surface 
this seems like a relatively simple problem. Why not simply ask the indi- 
vidual kow he feels about an object or class of persons? This could be 
done by having the individual rate, say, Negroes on a continuum ranging 
from “dislike very much” to “like very much.” Single ratings of over-all 
reactions are sometimes used in attitude studies, and although they are 
useful for making gross comparisons of people, they offer a number of 
drawbacks. 

If an individual is asked to give his over-all reaction on a single con- 
tinuum, this does not provide a reliable indication of the intensity of his 
reaction. The reliability of single-rating scales is usually too low to pro- 
vide a good estimate of attitudes. A more reliable location of the indi- 
vidual on an attitude continuum can be obtained by combining the scores 
made on a number of items. 

Another reason why devices other than the single rating scale must be 
used is that over-all expressions of likes and dislikes are often dominated 
by stereotyped reactions which do not represent true feelings. People Hp 
rate that they are either neutral or favorable toward Negroes, yet $ ey 
might also endorse more specific statements like, “I would not like to have 
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my children in an ‘integrated’ school,” and “Negroes are more likely to 
commit crimes than are white people.” 

An attitude scale is derived from a collection of statements all of which 
concern different degrees of positive and negative reaction to the object 
or person being studied. First it must be determined whether or not the 
statements concern only one attitude. It is possible that there are two or 
even several attitudes involved in the items, each of which should form 
a separate scale rather than be mingled in one, A statement such as “This 
city should help provide housing for migrant Southern Negroes” might be 
included in a scale to measure attitudes toward Negroes. It might prove 
to be as much or more an attitude toward the use of civic funds to com- 
pete with private industry. 

After obtaining a group of items which concern only one attitude, a 


method of administering and scoring the items must be determined so 
that people will be located as precisely as possible along the underlying 
continuum, or scale. These are the principal steps that are involved in 
attitude scaling. Some of the major techniques for obtaining attitude 


scales will be discussed in the following sections. 

Thurstone Method of Scale Construction. One of the earliest and still 
most widely used methods of generating attitude scales was developed 
by Thurstone (see 19) and his colleagues. The outstanding feature of the 
method is the use of judges to determine the points on the attitude con- 
tinuum. The first step in constructing the scale is to gather several hun- 
dred statements which seem to express various degrees of negative and 
positive attitudes toward the class of persons or objects being studied. 
One way of obtaining statements is to have a group of people write about 
their attitudes toward the particular person or object. Other statements 
can be obtained from popular literature on the issue. The experimenters 
can usually add numerous statements from their own experiences. 

Each statement is reproduced on a separate slip of paper. Several hun- 
dred persons are then chosen as judges. Each judge is handed the entire 
collection of statements and asked to rate them on an 11-point continuum 
ranging from “extremely favorable,” through “neutral,” to “extremely 
unfavorable.” It is important to note that the judges do not rate the extent 
to which they agree with each of the statements; they judge the intensity 
of each statement. 

The next step in constructing a scale of the Thurstone type is to com- 
pute indices of variability for each item. If all of the judges place a state- 
ment in the eighth, ninth, and tenth piles, this represents good agree- 
ment about the intensity of the item. If the placements are scattered all 
up and down the 11 piles, this indicates that the statement is either 
ambiguous or perhaps belongs to some other attitudinal factor. The stand- 
ard deviation of the placements of an item by the judges would serve as 
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an adequate measure of variability. The index which is actually used is 


one-half the distance between the 25th and 75th percentile of the judges’ 
ratings. Only those statements are retained for the attitude scale that have 
relatively low interjudge variability. 

The final step in constructing scales of the Thurstone type is to select 
from the remaining statements a group that spreads evenly over the 
intensity range. For this purpose the median intensity score is determined 
for each item. If half the judges place a statement at or above the seventh 
pile and half place it at or below the seventh pile, the median intensity 


score is 7. As is usually the case, the median lies at some point between 
two piles, and consequently medians are expressed to one decimal place, 


like 7.4 and 9.3. (The arithmetic mean would serve much the same pur- 
pose here as the median.) The scale position for each item is then consid- 
ered to be the median intensity judgment. The final scale consists of the 


twenty or so items which spread most evenly over the intensity range. 
Ideally, the items should have median intensity judgments respectively 
of 0, 0.5, 1.0, 1.5, and so on to the top of the scale. 

The final attitude scale is composed by randomly ordering the state- 
ments on a printed form. The subject is instructed to mark those state- 
ments with which he agrees. The score is the median intensity of the 
statements which are marked. If one person agrees with statements that 
have intensity indices of 9.5, 10.0, and 10.5, his score would be 10. In a 
perfect scale the individual would mark only one statement or several 
statements with similar intensity scores. However, in practical work with 
the scales, subjects vary somewhat from this expectation. It would not be 
uncommon to find a person who agrees with statements that have intensi- 
ties of 10, 9, 8.5, and then agrees with one of 5 also. 

The scale of the Thurstone type is an economical, common-sense type 
of instrument which has been used widely during the last twenty-five 
years. Special scales have been constructed for attitudes toward war, 
Negroes, censorship, patriotism, capital punishment, Chinese, and numer- 
ous others. Some of the items on the Scale for Measuring Attitudes toward 
the Church (19) are shown in Table 13-7. 

The use of judges to estimate the intensity of attitudes is both a strong 
and weak point of the Thurstone scale. It is an advantage in that the 
scale positions have a rational meaning that would be difficult to obtain 
by any other method. If the judges are in good agreement that a respond- 
ent who marks a particular statement has a positive or negative attitude 
to a certain degree, this tells us something directly about the respondent. 
The use of judges is particularly helpful in locating the neutral point on 
the attitude scale. Attitudes are inherently bipolar, ranging in opposite 
directions from a hypothetical neutral point. In campaigns to change par- 
ticular attitudes, it is very important to locate individuals in respect to 
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the neutral point. It is much easier to bring a positive change in an indi- 
vidual whose attitude is slightly positive rather than slightly negative, 
Sometimes it is more important to know precisely whic): side of the neu- 
tral point the individual is on rather than to know the exact distance from 
the neutral point in either direction. 


TABLE 13-7. ILLUSTRATIVE ĪTEMS AND SCALE VALUES FOR A THURSTONE TYPE 
MEASURE OF ATTITUDES TOWARD THE CHURCH 


(Adapted from Thurstone and Chave, 19, pp. 60-63) 


Scale value Item 
2 I believe the church is the greatest institution in America today. 
1.0 I believe that the church has grown up with the primary purpose of 
perpetuating the spirit and teachings of Jesus and deserves loyal support. 
2.2 IT like to go to church, for I get something worthwhile to think about and 


it keeps my mind filled with right thoughts. 
3.1 I do not understand the dogmas or creeds of the church but I find the 
church helps me to be more honest and creditable 


4.0 When I go to church I enjoy a fine ritual service with vood music. 

5.1 I like the ceremonies of my church but do not miss them much when 
I stay away. ; 

6.1 I feel the need for religion but do not find what I want in any one 
church, 


7.2 I believe that churches are too much divided by factions and denomina- 
tions to be a strong force for righteousness. 


8.2 The paternal and benevolent attitude of the church is quite distasteful 
to me. 


9.1 My experience is that the church is hopelessly out of date. 
10.7 I think the organized church is an enemy of science and truth. 


ee 


The Thurstone method of scaling assumes that intensity judgments are 
independent of the judges’ own attitudes, It is conceivable that this is the 
case, that, Say, two individuals who have different attitudes toward 
Negroes agree on the intensity of statements about Negroes. If judgments 
of intensity are not independent of attitudes, the scale positions are very 
much a function of the judges who are chosen. This would change all of 
the scale positions and would shift the neutral point as well. What would 
be scored a positive attitude on a scale obtained from one group of 
judges might be scored as a negative attitude on a scale obtained from a 
different group of judges. 

A number of studies (6, 8, 9) report comparisons of scales obtained 
from different kinds of judges. Hinckley (8) employed a group of South- 
ern white students who were prejudiced against Negroes to rate the 
intensity of 114 statements. The scale positions obtained were compared 
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with those for a group of Northern white students who were favorable 
toward Negroes. The two scales correlate .98, showing a very close agree- 
ment in the two relative orderings of the items. This and other studies 
show that at least the relative position of statements on the scale can often 
be determined without considering the attitudes of the judges themselves. 
However, the studies show that there are often significant differences in 
the absolute intensity of statements when judged by groups that hold dif- 
ferent attitudes. (It must be remembered that the correlation coefficient 
is a measure of covariation, not a measure of agreement.) A prejudiced 
group and a nonprejudiced group may agree that one statement is more 
positive than another, but they may differ in their judgments of how posi- 
tive the statements are. When the scale values are dependent in part on 
the particular group of judges which is used, the neutral point on the 
scale cannot be determined precisely. 

A potentially weak point in scales of the Thurstone type is the absence 
of any direct procedure for determining whether or not there is only one 
attitude involved in the statements. Statements that relate prominently 
to attitudes other than the one being studied would probably have high 
interjudge variability and would be removed from the scale for that rea- 
son. However, this is not a precise method for purifying the scale. As is 
sometimes done, a scale of the Thurstone type can be purified of extrane- 
ous factors by retaining only those items that show a substantial degree 
of homogeneity with the scale as a whole (see 19 for the procedure which 
is used). 

Likert Method of Scale Construction. The Likert method (see 11) also 
starts with the collection of a large number of positive and negative state- 
ments about an object, institution, or class of persons. Unlike the 
Thurstone approach, judges are not employed in the Likert method. 
Instead, the scale is derived by item-analysis techniques. The collection 
of items is administered to a group of subjects. Each item is rated on a 
five-point continuum ranging from “strongly approve” to “strongly dis- 
approve.” Then each item is correlated with total score, which shows the 
extent to which the item measures the same general underlying attitude 
as the total set of items. Items which have low correlations with total 
score are either unreliable or measure some extraneous attitude factor. 
Only those items which have high correlations with total score are retained 
for the attitude scale. The Likert scaling procedure helps ensure that the 
final scale concerns only one general attitude and that individuals can be 
located with at least moderate precision at different points on the scale. 

Comparing the Likert and Thurstone methods, the Likert a is 
more empirical because it deals directly with respondents scores rather 
than employing judges. The Likert method more directly Coe abe 
whether or not only one attitude is involved in the original collection o 
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Bens aed the wale which is derived measures the most 
tudinal factor which is present. The use of a five-point scale for x 
provides more information than the simple dichotomy of % 
agree.” The only place in which the Thurstone method might bes 
is is the direct meaningfulness of scale scores. No absolute 
be given to the points on the Likert scale. The scores are 
group from which the scale is constructed. It would be helpful 
more general norms, to at least be able to say that an individual & 
or below average by so much, but large-scale norms are seldom 
However, as was pointed out above, the meaning of scores on 
stone seale is sometimes dependent on the judges who are used. 
On the final scale of the Likert type, the subject marks each 
in one of the categories of strongly agree (SA), agree (A), 
(U), disagree (D), and strongly disagree (SD). If the 
agree” on a Statement, a score of 5 is given, 
given a score of 4, and so on to a score of 1 for “strongly 
scoring system is reversed for negative items. A “strongly agree” 
given a score of 1 and so on to 5 for a “strongly disagree” 
$ final score is obtained by summing the item scores. 
tive items from one of the scales are shown in Table 13-8. 


Tame 19-8 eocermanve Eres raou a Lixent Tyre SaLe 
Mrasonmxcent op Civitan MORALE DURING Tr, DEPRESSION 


(Adapted from Rundquist and Sletto, 12) 


The future looks very black. SA A U D SD 

Times are getting better. SA A U D SD 

No one cares mach what to you SA A U D SD 
Real friends are as easy to as ever SA AU D 

It ts difficult to think clearly these days. SA A U D SD 

praia be in SA AU D s 
living in these tims. SA AUD 

people can be trusted. — SA A U D SD 


technique for measuring attitudes toward different 
groups. Rather than being a scaling procedure, such as the 
and Likert methods, the Bogardus instrument is identified by a novel 


2 To my club as personal chums 
3 To my street as neighbors 


j 


No Í i -| peocesdures awd to emee Cheat thee 
wmn : ce » ns of Se 


be me eet Tepe ane Cr b mapi s 
Remon © chwe kimhip by marmisge bet as wines 
Aeris 


Very | she research has been dene to determine conclstins teteese 
paon: o thr uxtalditance wale amd thone on mune cwaneadaumel amb. 
wale vach es he TUNAO Mrena! Maa aan w 

k both on interesti amd a compia concept. mpn 
a denir | krge eg o araia. Ana 
becwsor | dickies the class of bet wg nen 


faih t core extreme atiedes, These wo valem digan of 
ditlike bs ond wanting to no social contact with « posp 

Batons have often found it to employ seral coerwemtiomal aats 
tode item: along with the to bp atre 


Segate (clings. This is Mustrated in the wake chevehogwet 
Crespi (5) for measuring attitudes toward “conscientious objet”; 
1 | would treat a conscientious objector no differently thas 1 woski 
an other person, even so far as having hèm become a chose rels- 
tive in marriage. 
2 1 would accept conscientious objectors only tn so far as having 
ther for friends. 
3 | would accept comcientious objectors only s far a having them 
for k acquaintances. 
4 1 lost want ashing todo with conscientias objectors 
5 I feel that conscientious objectors should be imprisoned. 
6 [feel that conscientious objectors should be shot as traitors 
It was necessary to add statements 5 and 6 to get at strong segative 
feelings 3 


The Bogardus scale is much easier to construct than the Thurstome of 
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the Likert instruments. It applies only to classes of people: it could not 
be used, for example, to measure attitudes toward labor unions. The same 
categories or a slight variant of them can be applied to whatever g 
is being studied. One of the interesting properties of the social-distanee 
scale is that it concerns actions rather than more general feelings about 
people. Although the research evidence on this point is very small, 
Bogardus social-distance scale might perform better than convention 
attitude measures in predicting how two groups of people will beh 
when brought together. 

The Guttman Method of Scale Construction. An interesting new 
approach to attitude scaling is the procedure developed by Guttman (see 
7, 14) in connection with studies of the morale of the American soldier 
during the Second World War. The objective of the method is to test 


Ficure 13-1. Response pattern for nine persons on nine items for a perfect Guttman 
Scale. A mark of 1 means that the person agrees with the statement, and a mark ¢ 
means that he does not agree with the statement. 


APRN ENF G H I 
Person 


directly whether a collection of items can be scaled on one attitude con= 
tinuum. The criterion of scalability is that if an individual endorses @ 
more extreme item he should endorse all less extreme items also. 
scaling criterion is applied to the scores obtained from a tryout group 
respondents. If a single scale underlies all of the items, the items can be 
arranged to form a triangular pattern like that in Figure 13-1. 

Figure 13-1 greatly simplifies the Guttman scale in order to illustr 
the principle that is at work. Usually there are both many more items a 
many more people being studied. Also, Figure 13-1 is composed in suc 
a manner that only one person has each of the different response patterns. 
Item 1 is the most extreme attitude, let us say extremely negative in this 


te 
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case. Only person A agrees with item 1, and as is true in a perfect scale, 
person \ agrees with the second most intense statement, 2, the third most 


intens- 3, and so on to the least intense, but still negative item 9. Person B 
holds the second most intensely negative attitude, and so he endorses 
item 2 ind all of those beneath it in the intensity hierarchy. Person I holds 
the lo. intensely negative attitude, agreeing to only the least intensely 


negative item, 9. 

The :esponse pattern found in the perfect Guttman scale is exactly 
what is obtained if people are rank-ordered on a physical continuum. 
Suppose, for example, that we ask people questions about their height, 
and assume that they know how tall they are, The person who answers 
“yes” to the question “Are you above 6% feet tall?” will answer “yes” to 
“Are you above 6 feet tall?” and “yes” to all questions about heights down 
to zero. The person who answers “no” to the question “Are you above 
61 fect tall?” but does answer “yes” to the question “Are you 
above 6 feet tall?” will answer “yes” to questions about all lower heights. 
Similarly with persons of all heights, by knowing the most extreme state- 
ment that a person endorses, his other responses can be predicted 
perfectly. 

In the example above, we started by knowing that only one scale, 
height, was involved. The purpose of the Guttman procedure is to test 
whether or not a collection of attitude statements will exhibit the charac- 
teristic pattern. It is an elegant notion; and if a set of items can be found 
which will fit this pattern, it is convincing evidence of scalability. 

Unfortunately there are some important practical drawbacks to the use 
of the Guttman scaling procedure. The principal one is that almost no 
collection of statements will meet the strict lawfulness of the scalability 
criterion. The criterion insists that each separate item be almost com- 
pletely reliable; but in practical work individual items are notoriously 
heavy with measurement error. Suggestions have been made for the use 
of approximately scalable collections of items, but even an approximation 
of scalability in the Guttman sense is rare to find. One solution to the 
problem is to search around through the items to find a set which will 
approximately meet the criterion of scalability. This takes considerable 
advantage of chance, and it will be expected that in many cases the scal- 
ing will not hold up when the items are administered to new groups of 
subjects. 

In the few cases in which the Guttman criterion of scalability has been 
approximately met, the items prove to be so closely related in zee as 
to constitute near rewordings of the same statement (see 14). There is 
little reason to believe that ace of oo. such as are handled 
by the Likert procedure will scale in the Guttman sense. 

Although ihe procedures of scaling identified with Guttman are not 
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suitable for most practical work, the concept of what constitutes a uni- 
dimensional scale is appealing and deserves more attention. The impor- 
tant thing to note about this concept of scalability is that it is defined 
entirely in terms of rank-ordering of persons and items. The basic data 
with which the method begins is the set of qualitative responses (cate- 
gories), either “agree” or “disagree” on each item, given by the respond- 
ents. The example concerning the heights of people illustrates how quali- 
tative responses can define a rank-ordering of people on an underlying 
scale. Most of the procedures for deriving and validating tests and atti- 
tude measurement devices assume that interval scales underlie the re- 
sponses. This is so because they employ means, standard deviations, 
correlations, and other statistics which make sense only when an interval 
scale is assumed for the measures being studied. 

Fewer assumptions about psychological data would need to be made 
if statistical methods that assumed only rank-order scales could be used. 
The great difficulty is that when the interval scale is abandoned in favor 
of rank-order and categories, most of the powerful statistical methods 
such as multiple correlation and factor analysis must be abandoned also. 
This is squarely the problem that is encountered in using the Guttman 
criterion of scalability. When the criterion is not met perfectly, and it is 
highly unlikely that it ever will be, there is little that can be done statis- 
tically to determine the number and kinds of scales which underlie the 
responses. What is needed is a method for analyzing the responses to a 
diverse collection of attitudinal statements which will determine the num- 
ber and kinds of rank-order scales which are present in the data. No ade- 
quate solutions to this problem have yet been developed. Until they are, 
Guttman’s concept of scalability and the procedures it entails are not 
likely to be of much practical importance in the construction of attitude 
scales, 

Factored Scales. One of the most direct approaches to determining the 
number and kinds of scales involved in a collection of items is to factor 
analyze responses. All of the conventional methods of factor analysis can 
be used on attitude items in the same way that they are applied to other 
sets of psychological measures. Desirably, the items should be rated on a 
continuum, using, say, five or more points, rather than to obtain only an 

agree” or “disagree” response. After the collection of items is adminis- 
tered to a group of respondents, preferably with at least several hundred 
persons included, all intercorrelations among the items are obtained. The 
intercorrelations are then factor analyzed. Each of the major factors con- 
stitutes a separate attitude scale. The items which relate most promi- 
nently to a factor can be used to construct the scale. 

As was stated above, the use of factor analysis to derive attitude scales 
assumes that interval scales underlie the responses. This is an assumption 
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that is quite generally made in statistical work with psychological meas- 
ures. The tests and attitude scales which have been derived in this way 
work well in practice. Factor analysis would have been used more broadly 
in the derivation of attitude scales were it not for the statistical labors that 
are involved. Now that high-speed computational equipment is becoming 
more available, large-scale factor analysis of attitude items will probably 
become more common. 

Methods of attitude scaling like Thurstone’s and Likert’s have been 
successful without factor analysis largely because they have dealt with 
collections of statements which are dominated by one large attitudinal 
factor. Although both methods tend to eliminate items that belong to 
small extraneous factors, neither method will work well if applied to 
collections of items in which several prominent factors are present. Factor 
analysis is more necessary in deriving scales in rather subtle attitude 
domains, like attitudes of employees toward working conditions in a 
particular factory and attitudes of people toward government spending. 
In terms of employee attitudes, it is likely that a factor would be found 
of over-all negative and positive attitude toward the work; but it would 
be expected also to find separable factors relating to different aspects of 
the work, such as danger of working conditions, adequacy of tools and 
equipment, and pleasantness of surroundings. 

The Interview. A not uncommon criticism of attitude scales is that they 
restrict the individual’s reactions to a relatively narrow range of content 
and prevent the individual from expressing his own attitudes in his own 
words. If the attitude items have all been constructed in the narrow con- 
fines of the psychological laboratory without recourse to what people 
actually think and feel, it is possible that few pertinent statements will 
enter the final instrument. s 

Some people prefer an open interview situation to the use of attitude 
scales. The respondent is asked a series of “open end questions like, 
“What do you think about school “integration?” The respondent a8 then 
allowed to say whatever comes to mind, perhaps with some prodding if 
he gets too far off the track. The responses are either written down 
verbatim or pertinent parts of the responses are noted. -A 

The advantage of the interview is that it permits a direct inquiry into 
attitudes without assuming in advance what specific questions and alter- 
native answers should be used. The responses will not only teach the 
investigator something about the issues which loom most largely in ie 
public mind but will also provide some suggestions as to why people 
feel as they do. 

There = several difficulties in using the interview to study attitudes. 
In the face-to-face interview situation there is usually more pressure i 
give socially acceptable responses, particularly those which the respond- 


312 Tests and Measurements 


ent thinks will please the interviewer, than is the case in the anonymous 
response to an attitude scale. In many studies, the interviewer must exer- 
cise his own judgment to determine how the respondent feels about an 
issue or why the respondent answers as he does. The subjectivity which 
this entails introduces unreliability into the results of the study. The 
major difficulty with the open interview is that it is very hard to find any 


systematic way of scoring and summarizing the answers that are given. 
The results of the study are often highly dependent on the experimenter’s 
intuition as to what the dominant trends in the data are. 

A procedure which is used in many investigations is to conduct inter- 
views preparatory to developing more standardized measures of attitudes, 
Fifty to a hundred persons varying in age and socioeconomic status are 
interviewed about their attitudes. This provides a rich source of attitudinal 
statements as well as suggestions about the methods of measurement 
which will be most effective in subsequent studies. 


CHARACTERISTICS OF ATTITUDES 


Reliability. Well-constructed attitude scales are usually as reliable as 
most aptitude tests are. As is true for all kinds of measures, the reliability 
is directly dependent on the number of items in the scale «nd the amount 
of correlation among the items. When short scales are employed contain- 
ing no more than five or so items, the scores will often not be sufficiently 
reliable to make predictions about individual respondents. However, even 
a short and relatively unreliable scale will serve to differentiate the atti- 
tudes of whole groups of persons, such as to differentiate the attitudes of 
employees and management about labor unions. 

Validity. Earlier in this chapter it was said that nearly all of the meas- 
ures of attitudes depend on verbal reactions rather than on observations 
of behavior. If an attitude scale is to be considered an assessment of 
expressed feelings, it should have content representativeness. Efforts to 
achieve content representativeness are made in most of the scaling proce- 
dures by starting with a broad collection of the things that people say 
about an issue. It was recommended to use some open interviews before 
constructing the scale, to help ensure content representativeness for the 
items. 

Many of the currently available attitude scales do a sufficiently good 
job of sampling verbal reactions to be considered reasonably valid meas- 
ures of expressed attitudes, The problem in validating attitude scales Is 
that scores are often given a wide significance beyond verbal reactions. 

If the scores on attitude scales are used to predict how individuals will 
behave in their daily lives, there is no substitute for empirical studies to 
determine predictive validity. There are vast practical difficulties involved 
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follow irom attitude scales. Three employees who show the same amount 
of negative attitude toward a company may do very different things in 
respect to their expressed attitudes. One might make a speech in a union 
meeting «bout grievances toward the company, another might do shoddy 
work on purpose, and the third employee might do nothing except com- 
plain to his wife about the unfairness of the company. 

Some confidence in the use of an attitude scale is obtained if it can be 
shown to relate prominently to at least some other kinds of relevant 
behavior. For example, Telford (18) compared the average scores on a 
scale measuring attitude toward the church with reports of church 


attendance (a low score indicates positive attitude): 


Frequency of church Average score on 
attendance attitude scale 
Regularly 1.91 
Frequently 2.48 
Occasionally 3.50 
Seldom 4.95 
Never 6.75 


Sims and Patrick (13) applied a scale of the Thurstone type for measur- 
ing attitudes toward Negroes to different groups of college students, with 
the following results (a high score indicates a positive attitude): 


Average score 
Northern students 6.7 
Northern students living 
in the South 5.9 
Southern students 5.0 


Depth of Attitudes. Psychoanalysis and other forms of depth psychology 
have taught us that what the individual says about his attitudes may be 
quite different from attitudes which he expresses in other ways. The indi- 
vidual may either consciously cover up socially unacceptable attitudes, or 
he may even fool himself and be unaware that there is a conflict between 
what he says and what he does. Some investigators place little faith in 
attitude scales, claiming that they measure only superficial attitudes and 
not “real” attitudes. The difficulty is that there are no accepted procedures 
for measuring “real” or “deep” attitudes. Projective devices have some- 
times been employed for this purpose. A typical procedure for studying 
Prejudice is to show the respondent a picture of a Negro talking to a 
white person, The respondent is asked to guess what they are talking 
about. If, for example, the respondent says that the white man is RA 
ing back some money which the Negro stole, this could, without too much 
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strain on the imagination, be considered evidence of a negative attitude 
on the part of the respondent. Procedures of this kind are interesting and 
might eventually lead to new tools for the measurement of attitudes, 
However, the projective devices which are presently available depend 
heavily on the subjective judgment of the interviewer, and there are 
numerous questions about the validity of the instruments. They are more 
suited to laboratory studies of small groups of people than they are to 
general use with diverse segments of the general population. (The pro- 
jective techniques will be discussed in detail in Chapter 15.) 

It should not be assumed that expressed attitudes are less important 
than other kinds of negative and positive reactions. What people say 
publicly is often more influential in determining the course of events 
than whatever they may “really” feel. Expressed attitudes are often 
derived from conflicts within the individual. A person may have learned 
in his early years that all foreigners are to be distrusted. Later experience 
and education could instill in him an attitude of humanitarianism and 
concern for people of all nations. The two attitudes sit side by side in 
the individual, both of them genuine feelings. If the individual publicly 
espouses the doctrine that we should do everything possible to help the 
downtrodden throughout the world, this does not necessarily mean that 
the expressed attitude is false or unimportant. 

It is necessary for clinical psychologists and psychiatrists to delve 
within the individual to learn the masking of one attitude with another 
and the conflicting feelings which lead to personal disturbance. However, 
for the sake of measuring public attitudes, it is usually more important to 
learn the feelings that people will express and stand on publicly. These 
both reflect and determine the public image, the common viewpoint that 
‘a group or a nation holds as a basis for action. 

The Meaning of Attitudes to the Individual. Two people may show 
much the same response on an attitude scale and differ importantly in 
other ways. One variable to consider is the relative abstractness versus 
concreteness of the attitudinal reaction, Some attitudes are abstract in the 
sense that they are formed with only a small amount of contact with the 
object, person, or institution concerned. Other attitudes are formed from 
considerable first-hand experience. As an example of an abstract reaction, 
up to ten or so years ago Turks were given markedly unfavorable ratings 
on attitude scales. This was so in spite of the fact that few of the respond- 
ents had any first-hand experience with Turks. This was evidently a his- 
torical holdover from the days when the then warlike Turks were in 
conflict with the Christian world. It is not uncommon to hear young 
children voicing the prejudices of their parents toward race and reli- 


gion. Such attitudes are largely second hand, based on little concrete 
experience, 
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Sometimes only very slight cues are used for the formation of attitudes. 
For example, rank your preferences for the following groups of people, 
rating the most preferred group 1 and the least preferred 4: 


Svlvanians 
\lusians 
Blugdugians 
Polurasians 


The majority of the people will rate the Blugdugians as the least pre- 
ferred group—this is so in spite of the fact that these are all imaginary 
groups of people. The Blugdugians are rated low because of the unmel- 
odious sound of their name. 

There is a real danger that some of the attitude studies measure only 
incidental abstract aspects of the issues. This is particularly likely in 
measuring attitudes toward institutions and associations, such as General 
Motors, the Americans for Democratic Action, and Harvard University. 
Something can be learned about the concreteness of attitudes by employ- 
ing a standard list of questions concerning the respondents’ knowledge 
and experience with the issue. Although there is no proof of it, the expec- 
tation is that highly abstract attitudes change more readily than ones 
based on considerable concrete experience. 

Another important characteristic of an attitude is the strength with 
which it is held. Some people feel more strongly about an issue than other 
people do. A common practice in the study of attitudes is to employ a 
strength of feeling question along with each attitude statement. That is, 
after the respondent registers his relative agreement or disagreement with 
a statement like, “All public schools should be ‘integrated’ as soon as pos- 
sible,” he is asked to rate the strength of his feeling in one of the cate- 
gories “very strongly,” “fairly strongly,” or “don’t care much one way or 
the other.” Strength of feeling corresponds closely with the negative or 
positive intensity of the attitude. People who express either strong nega- 
tive or strong positive attitudes usually say that they feel strongly about 
the issue, People who hold nearly neutral attitudes tend to say they “don't 
care much one way or the other.” Although it is conceivable that a person 
could feel strongly about a neutral attitude, this is seldom found. There 
is some evidence to indicate that intensely negative attitudes are held 
more strongly than intensely positive attitudes. 


INTEREST INVENTORIES 


Interests were defined earlier in the chapter as stated preferences a 
activities. As the word “stated” emphasizes, interest inventories depen 
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on the individual’s honest and accurate reporting of what he likes to do, 
At the outset some justification needs to be given for using interest inven- 
tories. The common-sense approach to learning about interests would be 
simply to ask the individual what occupations he prefers. 1f the individual 
already knows that he wants to be a physician, sea captain, or fireman, it 
would be a waste of time to have him record his preferences on a printed 
form. The purpose in administering tests is to gain some now information 
about people. 

There is a considerable amount of evidence to show that stated prefer- 
ences for occupations are unrealistic. This is particularly so among adoles- 
cents and young adults, with whom interest inventories are most needed. 
Young people are usually quite unaware of the specific activities which 
are entailed in different occupations, The individual’s stated preferences 
for occupations are often prompted by glamorized stereotypes. The physi- 
cian is remembered as the heroic figure who performs the miraculous 
operation while the gallery looks down in silent awe. The sea captain is 
seen holding steadfast to the helm against the stormy onslaught of the sea. 
The mental picture of the fireman has him descending the ladder with the 
rescued maiden on his shoulder. All of these images are of course very 
unrealistic. Few physicians do surgery at all. They must spend many hours 
in unheroic activities such as reading medical texts, writing reports, and 
calming the fears of anxious patients. The sea captain has scant opportun- 
ity to steer the ship because of the modern electronic gadgetry which 
automatically navigates. The captain is usually a sea-going businessman, 
ambassador to passengers and clients, who must be concerned with such 
matters as bookkeeping, personnel management, and correspondence. No 
one considers what the fireman does in the larger portion of his time- 
tending equipment, collecting funds for charities, and helping rescue cats 
from inaccessible perches. 

The purpose of the interest inventory is to ask the individual about his 
preferences for a wide range of relatively specific activities such as mend- 
ing a clock, preparing written reports, and talking to groups of people. 
From these a diagnosis is made of the occupations which most closely 
match the interests of the individual. A fundamental assumption in the use 
of interest inventories is that people in different occupations have at least 
partially different interests. Otherwise there would be no way in which 
interest tests could be used successfully to advise people to consider one 
occupation rather than another, 

The Strong Interest Inventory. One of the earliest and still most widely 
used measures of interests is the Vocational Interest Blank (VIB) devel- 
oped by E. K. Strong (17). Separate forms are available for men and 
women. The VIB employs 400 questions about relatively specific activities. 
On most of the items the subject indicates his preferences by marking one 
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of the three categories “like,” “indifferent,” “dislike.” Some illustrative 
items from the men’s form are as follows: 


Buying merchandise for a store CSD 
\djusting a carburetor L I D 
Interviewing men for a job El) 

Responses to the VIB can be scored in terms of 45 occupations on the 
men’s forms and 25 occupations on the women’s forms. A separate scoring 
key is available for each occupation. The scoring keys were developed 
from the responses to the VIB made by successful persons in each of the 
occupational groups. Each scoring key is composed in such a way as to 
differentiate the people in a particular profession from people in general. 
This procedure for developing scoring keys is referred to as criterion key- 
ing. Each scoring key consists of a set of weights to be applied to the item 


responses. The weights range from +4 to —4. A positive weight means 
that people in the profession, in accounting, say, mark “like” to the item 
more frequently than do people in general. A negative weight means that 
accountants mark the “like” category less frequently than people in gen- 


eral. The larger the difference between the profession and people in 
general, the larger is the weight. If an item does not differentiate a pro- 
fession {rom people in general, it receives a zero weight. A considerable 
amount of research work was required to obtain the occupational keys, 


and scoring keys for new occupations are gradually being developed. 

The responses of an individual are scored on either some or all of the 
professions. The scores can be converted to standard scores, percentiles, 
or to a grading system ranging from A to C. The resulting profile of scores 
is used to interpret the individual's interests. People usually express high 
interest in a number of related professions such as mathematician, engi- 
neer, and chemist. 

The Kuder Interest Inventory. The Kuder Preference Record (10) and 
the Stron g VIB are the two most widely used interest inventories. The two 
inventories present an interesting contrast in procedures of test develop- 
ment. Instead of using the “like,” “indifferent,” “dislike response cate- 
gories like those on the VIB, the Kuder inventory presents items in triads. 
The subject picks from three activities the one that he likes most and the 
one that he likes least. Two illustrative item triads are as follows: 


Visit an art gallery 
Browse in a library 
Visit a museum 


Collect autographs 
Collect coins 
Collect butterflies 
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Instead of scoring the form in terms of numerous separate occupations, 
scores are given in 10 general areas: outdoor, mechanical, computational, 
scientific, persuasive, artistic, literary, musical, social service, and clerical, 
The 10 categories were derived by item-analysis procedures, much the 
same as factor analysis would obtain. Unlike the VIB, the 10 interest areas 
on the Kuder inventory were not related in any direct way with the 
responses of people in specific occupations. The interpretation of an inter- 
est profile from the Kuder inventory is largely dependent on the judg- 
ment of the counselor as to the interests that are involved in different 
occupations. 

Although the Strong and the Kuder inventories started out on different 
tracks, recent developments have brought them closer together. A number 
of broad interest areas have been developed for the Strong VIB by factor- 
analytic studies. Scores on these can be obtained in addition to those for 
specific occupations. Research done since the Kuder inventory was pub- 
lished shows that the original logical method of analyzing interests leads 
to good predictions for certain occupations and poor predictions for others. 
Efforts are now being made to obtain equations for predicting how 
closely an individual’s interests on the Kuder inventory match those of 
people in different occupations. However, this work is not far enough 
along to provide the same empirical evidence for interpreting responses 
as is furnished by the VIB. 

Interests and Accomplishment. Because a person is interested in certain 
activities, such as those relating to engineering, it does not necessarily 
mean that he has the capacity for accomplishment in that field. The rela- 
tionship between interests and ability is particularly tenuous in children 
and young adolescents, The child who professes an interest in athletic 
activities, for example, may have little athletic ability, and similarly for 
artistic and scientific pursuits. However, there is an increasing congru- 
ence between interests and ability as the individual matures. It is very 
difficult for a person to maintain an interest in activities in which he con- 
stantly performs poorly. As the child matures, his interests gradually shift 
to the things that he can do at least relatively well. 

Stability of Interests. Without some stability over time, scores on inter- 
est inventories would be of little use in advising people on vocational 
choices. Interests are notoriously unstable in children and adolescents: 
They begin to stabilize in the late teens and remain remarkably stable 
throughout adulthood. Strong (16) found retest correlations in the .70s 
and 80s over intervals as long as twenty-two years. This is both a credit 
to the VIB and strong evidence that interests are relatively enduring 
characteristics of human adults, j 

Interest Inventories in Vocational Guidance. Interest inventories are 
second only to intelligence tests as aids to vocational guidance. Interests 
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are, at least theoretically, very important to consider in choosing occupa- 
tions. If an individual really likes a particular type of work, he can often 
succeed in spite of only a moderate amount of aptitude. No matter how 
much initial aptitude a person has, he can fail in a line of work through 
inattention and lack of effort. 

In vocational guidance, interest inventories are used for two related 
purposes: to predict satisfaction in the work and to predict successful 
performance, The criterion keying on the Strong inventory provides some 
supporting evidence that interest tests can predict future satisfaction on 
the job. Another type of evidence is that follow-up studies of individuals 
who completed the VIB in college show a strong tendency for people to 
enter occupations similar to their expressed interests. Both these pieces of 
evidence also tend to support the hypothesis that interests are predictive, 
at least to some extent, of job performance. Strong (15, 17) has gathered 
more direct evidence to show that interest scores are predictive of per- 
formance in some occupations. For example, there is a relationship 
between the amount of interest shown on the key for insurance agents 
and the amount of insurance which agents sell. 

Eveu though interests are, at least theoretically, very important to con- 
sider in choosing occupations, it does not necessarily follow that the 
available instruments are maximally effective measures of interests. As is 
true in most areas of testing, a great deal more research with interest 
inventories is needed. 

It is unfortunate that interest tests cannot be used as successfully in 
the selection of people for particular jobs as they can in vocational guid- 
ance. It has been shown repeatedly that people can fake interest tests to 
a marked extent. If people are told to mark the Kuder or the Strong 
inventory as a successful engineer or physician would, they will obtain 
profiles similar to the profession in question. heh 

People usually give honest responses in a vocational guidance situation. 
They are there for information and advice, and there is little to gain by 
faking an interest inventory one way or the other. If, as is usually the 
case, the vocational guidance facility is not connected with personnel- 
selection programs, there is no way in which test scores can lower the 
individual's chances of getting a particular job. When an individual 
applies for a particular job, he is seldom as desirous of learning about 
himself as he is of obtaining the position. If the individual is applying for 
a job as an electrician, he knows that it behooves him to answer “yes to 
an interest item like, “Do you like to repair electrical motors?” The small 
amount of success that interest inventories meet in personnel-selection 
programs should not mar the important place they have in vocational 
guidance, 
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CHAPTER 14 


Inventories for Self-description 


This and the succeeding two chapters are concerned with measures of 


personality. There are almost as many definitions of personality as there 
are methods to measure personality. The term personality is popularly 
used to indicate social charm. A person is said to have a lot of personality, 


meaning that he possesses all of the social graces. 

Personality is often defined negatively as the nonability or noncognitive 
functions. Any measurement device that is not clearly either an aptitude 
or achievement test is often classified as a personality test. It is unsatis- 
factory to define things in this way. For example, it would be of little 
help to a Martian to be told only that a dog is something that is neither a 
cat nor a cow. 

It proves very difficult to give concise definitions of personality, mainly 
because there are many kinds of human attributes which could be so 
labeled. A typical definition is “personality is the total functioning indi- 
vidua! interacting with his environment.” This definition certainly lacks 
little in terms of inclusiveness, but it provides no starting point for the 
measurement of personality characteristics. The best starting point for a 
discussion of personality measurement is to distinguish some of the differ- 
ent kinds of personal characteristics which can be studied. In the preced- 
ing chapter the human attributes of opinions, attitudes, and interests were 
defined and treated separately from the kinds of attributes which will be 
treated here and in the succeeding two chapters. 

The following three kinds of attributes are most often discussed under 
the term personality: $ 

1. Social traits. The characteristic behavior of an individual in respect 
to other people. Typical traits are honesty, gregariousness, shyness, domi- 
nance, talkativeness, and humor. Social traits are often referred to as the 
surface layer of personality, the way that the individual appears in society. 
Inventories which seek to measure social traits are said to concern the 
normal personality. f 

2. E This covers a broad range of disorders including 
hypertension, lack of emotional control, intense fears, disruption of 
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thought processes, and depression, among others. Although the definition 
of maladjustment depends in part on the values in a particular culture, a 
person is usually said to be maladjusted if his own behavior is severely 
disturbing to himself and/or the people around him. When a person 
becomes highly distraught or his behavior is odd or dangerous, he is 
classified as mentally ill. 

3. Dynamics. A difficult to define collection of the underlying indi- 
vidual processes which produce different states of adjustment and differ- 
ent kinds of social traits. Dynamics involves such things as drives, the 
Oedipus complex, amount of ego strength, body image, dream symbolism 
—none of which can be observed directly but must be complexly inferred 
from the individual's behavior. 

The instruments which will be discussed in this chapter are primarily 
concerned with social traits and maladjustment. Although they relate 
to some extent to dynamics, the projective tests to be discussed in the 
next chapter are the primary instruments in the realm of dynamics. 

A careful distinction should be made between inventories for self- 
description and other kinds of paper-and-pencil personality tests. The 
instruments which will be discussed in this chapter are all standard pro- 
cedures for having the individual say what he is like. Typical items are 
“Do you lead the discussion when talking to others?” “Do you enjoy 
competition?” “Are you attractive to the other sex?” There are other kinds 
of personality inventories which are not concerned with self-description. 
An example is a measure of suggestibility, or “acquiescence,” in which 
the subject is presented a series of highly ambiguous statements and is 
asked to indicate whether he agrees or disagrees with each (supposedly 
the person who tends to agree in this situation is more “acquiescent”). 
Some inventories which are not directly concerned with self-description 
will be discussed in the next two chapters. 

Dozens of personality inventories are currently available; hundreds 
have seen general use during the last forty years, and thousands have 
been employed for particular research and selection problems. Conse- 
quently, the inventories which will be discussed in this chapter are 
examples chosen only to represent different kinds of testing problems. 


MALADJUSTMENT 


Self-report is one of the basic tools in the diagnosis of illness. The 
physician asks, “Where does it hurt?” “How is your appetite?” The patient 
volunteers information like, “I don’t sleep so well,” and “I am all out of 
pep.” The self-report technique carried over into psychiatry and clinical 
psychology. The questions are different but the method is the same. The 
psychiatrist asks the patient whether he has nightmares or feels uncom- 
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fortable in a crowd; he inquires how the patient gets along with parents 


and friends. Over the years it became apparent that certain questions are 
more successful than others in detecting maladjustment. These have 
become almost standard questions for all interviews, used along with 


questions that have particular relevance to a case. It is an easy jump 
from a list of standard questions to a test: all that is necessary is to write 
the questions down and have the subject indicate his agreement or dis- 
agreement with each. This is exactly how the first personality inventories 
were developed. 

The first self-description inventory that achieved prominence was 
Woodworth’s Personal Data Sheet (19). The inventory was developed 
as a means of weeding out emotionally unstable persons from the United 
States Army. A standard, time-saving procedure was needed because of 
the shortage of trained interviewers, The inventory contains 116 questions 
concerning “neurotic” tendencies, ten of which are as follows: 

Are you troubled with dreams about your work? 

Do you often have the feeling of suffocating? 

Have you ever had fits of dizziness? 

Did you ever have convulsions? 

Did you have a happy childhood? 

Have you ever seen a vision? 

Did you ever have a strong desire to commit suicide? 

Can you stand the sight of blood? 

Are you troubled by the idea that people are watching you on the 
streetP 

Does it make you uneasy to sit in a small room with the door shut? 

The questions were obtained from a search of the psychiatric literature 
and from conferences with psychiatrists. A neurotic tendency score is 
obtained for each person by adding the number of “neurotic” responses. 

A small amount of research was undertaken to standardize the Personal 
Data Sheet. Items were eliminated if more than 25 per cent of normal 
individuals gave the “neurotic” response. Comparisons were made be- 
tween the responses given by unselected draftees and those given by a 
small group of declared neurotic soldiers. Woodworth recognized that 
the validity of the Personal Data Sheet depends on the diagnostic value 
of the questions which are asked and on the honest reporting of the 
subject. The inventory can easily be faked by anyone who chooses to 
do so. 

The Personal Data Sheet was not considered to be a test in the strict 
sense of the word. Persons who gave more than 30 or 40 “neurotic 
responses were brought in for detailed psychiatric interviews. eer. 
little direct evidence for validity was obtained, people who worked wit 
the Personal Data Sheet during the First World War were generally 
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satisfied with the inventory as an aid to psychiatric screening. After the 
First World War an interest developed in the construction of tests of all 
kinds, personality inventories included. Most of the inventories were 
modeled directly after the Personal Data Sheet, to the extent of using 
many of the same items. 

Bell Adjustment Inventory. A recognized weakness of tlie Personal Data 
Sheet is that it provides only one over-all index of adjustment versus 
maladjustment, providing no information about the kind of maladjust- 
ment. The Bell inventory (2) seeks to measure adjustment in the four 
areas of home adjustment, health adjustment, social adjustment, emo- 
tional adjustment, plus a score for total adjustment. The Bell inventory 
was not only one of the first instruments to measure differential adjust- 
ment, it was one of the first to be constructed in terms of item-analysis 
statistics. Bell had originally hypothesized 10 adjustment areas. Items 
were either selected from previous inventories or composed to measure 
each of the 10 areas. The items were administered to groups of students. 
Items were retained in each adjustment area only if they correlated with 
the total score in that area. Only 4 of the 10 hypothesized areas hung 
together well enough to be included in the final form. Ensuring that items 
in each adjustment area are homogeneous does not mean that they are 
necessarily valid for any purpose, but it does give some assurance that 
the items tend to measure the same underlying trait. 

The Bell Adjustment Inventory has been used widely during the last 
twenty years as an aid in the counseling of high school and college 
students. The inventory is administered routinely in many schools to 
freshmen students. Those who have low scores in one or several of the 
adjustment areas are brought in for interviews to see if counseling and 
other help are required. 

Minnesota Multiphasic Personality Inventory (MMPI). The MMPI 
(13, 14) represents the apex of research and detailed test construction in 
the area of adjustment inventories, Research on the instrument has gone 
on for twenty years now, and hundreds of journal articles have been de- 
voted to its construction, refinement, and use. The MMPI is intended to 
measure the relative presence or absence of the nine forms of mental 
illness listed below. Two related items are shown for each type of mental 
illness. A plus sign means that persons who have the illness are likely 
to agree with the item; a negative sign means that they are likely to 
disagree. i 

Hypochondriasis (Hs). Overconcern with body functions and imagined 
illness. 

Related items: 

I do not tire quickly. (—) 


The top of my head sometimes feels tender. (+) 
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Depression (D). This is used in the conventional sense to imply strong 
feelings of “blueness,” despondence, and worthlessness. 

Related items: 

| ain easily awakened by noise. (+) 
Everything is turning out just as the prophets of the Bible said it 
would. (+) 

Hysteria (Hy). The development of physical disorders such as blind- 
ness, paralysis, and vomiting as an escape from emotional problems. 

Related items: 

I am likely not to speak to people until they speak to me. (+) 
I get mad easily and then get over it soon. (+) 

Psychopathic deviate (Pd). An individual who lacks “conscience,” who 
has little regard for the feelings of others, and who gets into trouble 
frequently. 

Related items: 

My family does not like the work I have chosen. (+) 
What others think of me does not bother me. (+) 

Paranoia (Pa), Extreme suspiciousness to the point of imagining 
elaborate plots. 

Related items: 

I am sure I am being talked about. (+) 
Someone has control over my mind. (+) 
Psychasthenia (Pt). Strong fears and compulsions. 
Related items: 
I easily become impatient with people. (+) 
I wish I could be as happy as others seem to be. (++) i 

Schizophrenia (Sc). Bizarre thoughts and actions, out of communi- 
cation with the world. 

Related items: 

I have never been in love with anyone. (+) 
I loved my mother. (—) A 

Hypomania (Ma). Overactivity, inability to concentrate on one thing 
for more than a moment. 

Related items: 3 in thi 

I don’t blame anyone for trying to grab everything he can get in this 
world. (++) é 
When I get bored I like to stir up some excitement. (+) 4 

Masculinity-femininity (Mf). The relative balance of male versus 
female interests. 

Related items: 

I like movie love scenes. (F) 
I used to keep a diary. (F) 


There are several noteworthy features of the MMPI which set it above 
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most inventories which are used to detect maladjustment. Face validity 
was not a concern in the construction of the instrument as it is too often 
with other inventories. The scales used to measure the nine kinds of 
mental illness were developed on an empirical basis by criterion keying. 
A large group of items was administered initially to several hundred 
normal persons and to groups of mental hospital patients whose symp- 
toms matched one of the nine kinds of mental illness. Item analyses were 
undertaken to find the scoring key for each illness scale which would best 
differentiate the patients of one type from normals and from other types 
of patients. 

The criterion-keying procedure used on the MMPI is much like that 
used to develop the Strong interest scales. Criterion keying is advantage- 
ous in that it promotes the predictive validity of an instrument, and it 
places little dependence on the insight and frankness of the respondent. 

In addition to the nine mental illness scales available on the MMPI, 
four so-called “validity scores” are used. These provide some information 
about the test-taking attitude of the subject and the relative honesty with 
which his responses are made. The “validity scores” are as follows: 

The Question Score(?). This consists of the number of items marked 
in the “cannot say” category. The interpretation is that if a person 
has a high question score, the scale scores for the different kinds of 
illness appear lower than they should be. If a person has as many as 
130 “cannot say” responses, the individual's test record is assumed 
to be invalid. 

The Lie Score(L), This scale consists of 15 items concerning socially 
desirable actions which few people could truthfully endorse; e.g., 
“I like everyone I know.” The assumption is that an individual who 
endorses numerous items of this kind is falsifying the inventory. 
The Validity Score( F ). This consists of 64 items which are endorsed 
infrequently by normal subjects. They concern a hodgepodge of 
symptoms which are not likely to occur in any one mental illness. 
The interpretation is that a person who endorses a number of these 
items is careless or does not understand the test instructions. 

The Correction Score(K). This consists of 30 items which were found 
to differentiate clinical patients whose scale scores appeared normal 
from persons who were actually normal. The responses to these 
items can be used to correct the illness scores for particular persons. 

It should not be assumed that “validity scores” like those on the MMPI 
are a panacea in the use of personality inventories. There is not enough 
evidence yet to support the contention that the “validity scores” actually 
measure what they are purported to measure, The question, lie, and 
validity” scores may indicate that a respondent gives misleading Te 
sponses, but nothing can be done except to throw away the test record. 
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The correction score (K) offers a more positive approach, but not enough 


research has been done to say how well it works. 

The results of the MMPI can be plotted as a profile showing scores on 
the nine illness scales and those on the four “validity” scales (see Figure 
14-1 for illustrative profiles). It is seldom found that a person scores 


Ficure 14-1, MMPI profiles for a normal adult (___) and for a “typical psychotic” 
(—). (Adapted from Gough, 11, pp. 554 and 563; reproduced by permission of The 
Ronald Press Company.) 
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high (the maladjusted direction) on only one of the scales. Typically, 
the maladjustment spreads across several of the scales. Some of the scales 
correlate substantially with the others, which is to be expected because 
mental illness seldom occurs as one specific pattern of traits. It is usually 
the case that the patient has a mixture of different kinds of mental illness. 

In order to interpret the MMPI profile, complex pattern scoring meth- 
ods have been devised (14). Even with these it is necessary to have an 
experienced clinical psychologist to interpret the results. As is true of 
many clinical methods, a complex lore has developed about the meaning 
of different kinds of MMPI profiles, much of which has only slight 
grounding in empirical fact. 

Much stands or falls on the eventual validity of the MMPI. Never be- 
fore (and perhaps never again) has so much careful research gone into 
the empirical derivation of a self-description inventory to measure differ- 
ent kinds of maladjustment. The success of this venture will have a strong 
effect on the future course of personality measurement. The present 
evidence indicates that the MMPI is better than any comparable mal- 
adjustment inventory but not nearly as effective as was hoped for 
initially, 
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SOCIAL TRAITS 


The difference between measures of maladjustment and social traits 
is more a matter of degree than a hard and fast distinction. Some of 
the inventories which are primarily concerned with social traits also 
contain one or more scales concerning different kinds of maladjustment. 
The extremes of most social traits border on maladjustment. For example, 
shyness, until it becomes extreme, can be a manifestation of normal 
behavior. 

Allport Ascendance-Submission Reaction Study. One of the earliest 
self-description inventories and one that is stil] very much with us is the 
ascendance-submission (A-S) inventory developed by the Allports (1). 
Its purpose is to measure the tendency of a person either to lead and 
dominate his friends and associates or to be led and dominated by them. 
Each item describes a situation in which the respondent can show either 
dominance or submission to some degree. An illustrative item is as 
follows: 

Someone tries to push ahead of you in line. You have been waiting 

for some time, and can’t wait much longer. Suppose the intruder is 

the same sex as yourself, do you usually: 

Remonstrate with the intruder 

Call the attention of the man at the 
ticket window 

“Look daggers” at the intruder or make 
clearly audible comments 

Decide not to wait and go away 

Do nothing ee. 

Scoring weights for the A-S inventory items were determined in such 
a way as to differentiate persons who were rated high in dominance, by 
themselves and by associates, from those who were rated low in domi- 
nance. The A-S inventory has been used extensively during the last 
twenty years, and the research findings show that, on occasion, it can 
Serve as a useful and valid measure. 

Bernreuter Personality Inventory. Bernreuter’s inventory (3) first ap- 
peared in 1932 and has been in wide use since that time. His purpose 
was to combine in one inventory the information which could be 
obtained from a number of other inventories. He used items which 
appeared in current tests of that day and composed scoring keys for the 
four scales of “neuroticism,” “self-sufficiency,” “introversion,” and “domi 
nance.” The only evidence of validity offered by Bernreuter is that the 


scales on the inventory correlate highly with the parent inventories from 
which the items were drawn, 
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One of the difficulties exemplified by the Bernreuter Personality In- 
ventory is that of obtaining independent measures of different traits. 
If, as on the Bernreuter inventory, there are four scales, it is expected 
that the four scores concern at least partially different traits, This is far 
from the case on the Bernreuter and on many other personality inven- 
tories. [ntroversion and neuroticism scores on the Bernreuter correlate 
.95, meaning that they are almost totally identical. 

Beginning in the 1930s and continuing to this time, factor analysis and 
similar procedures have been used extensively to construct differential 
personality inventories. Flanagan’s (9) factor analysis of the Bernreuter 
scales set the stage for much such work to follow. He found that two 
factors account for nearly all of the variance in the four scales which 
Bernreuter employs. He called these two factors “self-confidence” and 
“sociability.” Bernreuter added scoring keys for Flanagan’s two factors, 
which are used in addition to the other four scales. Six names do not 


constitute six traits. When the inventory is used, only the two scales 
developed by Flanagan should be scored. 

Multifactor Batteries. An extensive effort has been under way to chart 
the major factors arising from self-description inventories. The hope is 
to find a limited number of traits that account for the scores obtained 
from diverse inventories. A series of studies by Guilford and his asso- 
ciates laid the groundwork in this field. He first collected 35 statements 
which were purported by different authorities to be primary aspects of 
introversion-extraversion. These were made into an inventory and ad- 
ministered! to groups of subjects. A factor analysis of the items produced 
five factors instead of the single continuum along which introversion- 
extraversion is commonly judged. Successive factor analyses were per- 
formed on new sets of items until 13 factors were found in all. The 10 
most prominent factors appear in the Guilford-Zimmerman Temperament 
Survey (12). The factor names and descriptions are as follows: 7 

G, General Activity. High energy, quickness of action, liking for 
speed, and efficiency 
R, Restraint. Deliberate, serious-minded, persistent 
A, Ascendance. Leadership, initiative, persuasiveness res 
S, Sociability. Having many friends, and liking social activities i 
E, Emotional Stability. Composure, cheerfulness, evenness o 
moods 


O, Objectivity. Freedom from suspiciousness, from hypersensitivity, 
and from getting into trouble a 
F, Friendliness. Respect for others, acceptance of domination, 


toleration of hostility ; 
T, Thoughtfulness. Reflective, meditative, observing of self and 


others 


330 Tests and Measurements 


P, Personal relations. Tolerance of people, faith in social instity. 
tions, freedom from faultfinding and from self-pity 
M, Masculinity. Interest in masculine activities, hardboiled, not 
easily disgusted, versus (for femininity) romantic and emo- 
tionally expressive 
Two other widely used multifactor inventories are The Sixteen Person- 
ality Factor Questionnaire (5) developed by Cattell and the Thurstone 
Temperament Schedule (18). 

Special Areas. In terms of numbers, most of the personality inventories 
are constructed for particular research or selection problems rather than 
for commercial distribution. These range from instruments to detect brain 
damage to inventories for the selection of sales clerks, There has been a 
boom in the construction of inventories for industrial selection. In some 
organizations, everyone from the janitor to the company president must 
make his marks on a self-description inventory. Much of the work which 
is done on industrial inventories is not reported on in professional jour- 
nals or in other readily available sources. The reason for this is that such 
specialized inventories are not of interest to many people, and also there 
is a desire in many cases to prevent competing commercial concerns from 
adopting the same or similar inventories. 

A typical inventory developed for a special research problem is that 
of Burgess and Cottrell (4) used to measure marital happiness. Five of 
the inventory items are as follows (multiple choice alternatives are sup- 
plied for each item): 

Do you kiss your husband (wife) every day? 

Do you ever wish you had not married? 

What things annoy and dissatisfy you most about your marriage? 

If you had your life to live over, do you think you would marry 

the same person? f 

Do husband and wife engage in outside interests together? 
The inventory was administered to married couples to measure their level 
of marital adjustment. The responses were used as a criterion to deter- 
mine the predictive validity of background data as a forecaster of mar- 
riage happiness. 


EVALUATION OF SELF-DESCRIPTION INVENTORIES 


The inventories discussed so far in this chapter have been passed over 
lightly. Little attention has been given to the reliability, norms, validity, 
and uses for particular instruments. The reason for this is that a detailed 
discussion of separate inventories would not be worth the space it woul 
take. It is not unduly critical to say that self-description inventories have 
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mainly produced disappointing results (see 7, 8, 10, 17). The remainder 
of the chapter will consider some of the reasons why self-description 
inventories meet with difficulty. 

Reliability. Self-description inventories are usually less reliable than 


tests of aptitude, achievement, interests, and attitude. There are numer- 
ous exceptions to this rule, but self-description inventories tend to have 
lower reliability than most other psychological measures. Reliability co- 
efficients for the better-established self-description inventories usually 


range between .75 and .85, Well-established aptitude and achievement 
tests usually have reliability coefficients between .85 and .95. The reli- 
ability of self-description inventories is usually not so low as to render 
them unusable, but the amount of measurement error which is often 
present is cnough to materially impair the validity. 

As is true in all tests, the reliability of self-description inventories can 
usually be raised by increasing the number of items. Some inventories 
employ large numbers of items in order to make the results reliable. 
Individual items on self-description inventories are usually less reliable 
than aptitude and achievement items. It is not uncommon to find, for 
example, that a 20-item vocabulary or mathematics test has a reliability 
over .85. It is usually necessary to have twice that many items in a self- 
description inventory to reach the .85 level of reliability, if that level can 
be reached at all. 

Validity. One of the major observations to make about self-description 
inventorics is that in many cases almost no effort is made to investigate 
validity. Many of the inventories are founded on “face validity.” The 
inventory items are often so directly related to the trait being measured— 
e.g, “Do you often feel depressed?”—that it is an easy step to assume 
that inventories measure what they seem to measure. Face validity is not 
only insufficient evidence of validity, it can actually be a harmful quality 
in an inventory. The more transparent the items and the more they look 
like the characteristic to be measured, the more they are open to dis- 
tortion by the respondent. 

Another approach to the determination of validity has been to corre- 
late one self-description inventory with others. This is not necessarily an 
indication of validity. If self-description inventories correlate with one 
another, this may mean only that people are consistent in applying mis- 
conceptions about themselves or that they are consistent in purposely 
distorting their own characteristics. ; nae 

Most self-description inventories require an investigation of predictive 
validity (the term “prediction” being taken broadly to mean a func- 
tional relation with other forms of behavior either in the past, at present, 
or in the future). The great difficulty in validating many self-description 
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inventories is that there is almost nothing available to predict. For 
example, what should the “friendliness” factor on the Guilford-Zimmer- 
man inventory predict? 

The adjustment-maladjustment inventories are in a better position to 
establish validity than are the inventories for social traits. A procedure 
which is often used to validate maladjustment inventories is to compare 
the scores of contrasting groups of persons—mental hospital patients with 
people outside, psychiatric rejects from the Army with soldiers in general, 
persons who suffer from psychosomatic ailments with people in general, 
Even if an inventory successfully distinguishes the members of contrast- 
ing groups, it might not be able to make precise distinctions in the 
borderline area between adjustment and maladjustment 

It is sometimes possible to validate social trait inventories with con- 
trasting groups. For example, a “leadership” inventory can be used to 
distinguish persons who achieve positions of leadership from those who 
do not. In spite of the incompleteness of the “contrasted groups” design, 
it is one of the best methods of validation currently available for many 
personality inventories. 

Personality inventories can be validated most easily when they are 
used to select persons for particular jobs, positions, and courses of train- 
ing. If a reliable success criterion is available, it can be used to test the 
predictive validity of self-description inventories. For example, this can 
be done by correlating scores made on a self-description inventory by 
prospective salesmen with the amount of sales made later. 

In those cases where the validity of self-description inventories can 
be determined either by the method of contrasted groups or by correla- 
tion with a continuous criterion, the inventories usually show disappoint- 
ing results. There are isolated cases of self-description inventories which 
prove themselves valid to a reasonable extent, but these are few and far 
between. In addition to the generally low level of validity found for self- 
description inventories, the validity coefficients tend to change markedly 
with variations in testing circumstance. 
administer self-description inventories to people who are already working 
on a job, and to validate the inventory by correlating it with measures 
of job success, A substantial correlation found in this way is no guarantee 
that the same results will be found when the inventory is used to select 
people for jobs. When people are applying for a job, they are eager to 
appear suited for the position, and the personality inventory is usually 
answered in such a way as to make the applicant appear in the best 
light. People who are already on the job are not under the same pressure 
to fake an inventory. i 

The responses that people 
validity of inventor 


s. A common practice has been to 


give to personality inventories and te 
les can change with subtle changes in the testing 
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environment. The author! found that in military settings the responses 
to self- scription inventories are sometimes different when they are 
adminisiered by civilians rather than by officers and different when the 


respondents are told that the results will become part of their military 
records than when they are told that the results are to be used strictly 
for rescarch purposes. Even if an inventory proves valid in one testing 
situation, it must be revalidated for any other situation in which it is used 


even if the two situations appear quite similar. 

Self-clescription inventories have so far been of no practical aid in the 
prediction of school success. Although personality must certainly be an 
important element in school work, the author knows of no situation in 
which a self-description inventory has added materially to the predictive 
efficiency of conventional aptitude batteries. Self-description inventories 
work out better as aids to student counseling and guidance. The student 
who rates himself as unhappy, lacking in self-confidence, and without 
friends is likely to be in need of help. Persons who rate themselves as 
maladjusted are good prospects for more detailed study. There is no 
guarantee that the student who describes himself as happy, confident, 
and abundantly possessed with friends is really well adjusted. He may 
actually be well adjusted or he may be as sick or sicker than the person 
who relates his troubles. 

The validity of self-description inventories in job selection varies con- 
siderably with the job (see 10). The inventories perform moderately well 
in the selection of salesmen and sales clerks and have some validity for 
clerical workers and skilled trades; they work poorly in the selection of 
supervisors, foremen, and service workers (such as policemen). 

Language difficulties. The validity of self-description inventories de- 
pends to a considerable extent on the clarity with which items are 
phrased, Consider, for example, a typical item such as “Do you usually 
lead the discussion in group situations?” Respondents must interpret 
what is meant by “usually’—60 per cent of the time, 75 per cent of the 
time, or 90 per cent of the time. Does the word “lead” mean to talk 
the most, make the most important points, or have the final say? Does 
the phrase “group situation” pertain only to formal groups such as club 
meetings, or does it include casual discussions among friends? This d 
be overdoing the difficulties of communicating social traits, but it illus- 
trates the need for language clarity in the phrasing of self-description 
items, 

In addition to the difficulty of wording items clearly, test constructors 
must be careful in describing the traits measured by inventories. This 
is true both of the inventories derived by factor analysis and those which 
are constructed on a “rational” basis. Some of the trait names appearmg 


* Unpublished research. 
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on current inventories are esoteric and confusing, such as “rathymia,” and 
‘adventurous cyclothemia.” A school psychologist who uses self-descrip- 
tion inventories must know the meaning of the traits being measured 
before the results can be put to any valid use. The problem is not con- 
fined to self-description inventories. Aptitude factors like verbal com- 
prehension and perceptual speed might be misunderstood by the test 
user, but there is more of a problem in communicating the meaning of 
personality factors. 

Nunnally and Husek (15) studied the clarity of factor interpretations 
in three widely used multifactor inventories. They found that the factors 
in one battery were considerably less understandable than those in the 
other two. Because of the difficulties in communicating the meanings of 
personality traits, an attempt should be made to determine the clarity of 
factor interpretations before inventories are distributed commercially, 

A serious consequence of the difficulty of naming and explaining per- 
sonality traits such as those found in the multifactor inventories is that 
different inventories which purport to measure the same trait may have 
little in common. Correlations are sometimes very low between different 
inventories used to measure a trait such as introversion. Consequently, 
whether or not a person is said to be introverted depends on the inventory 
which is used. 

Trait Clusters. Little is gained from knowing responses to particular 
self-description items unless the items evolve into more general clusters 
of traits. The hope is that there are a relatively small number of clusters 
or factors which underlie the many specific social traits, Factor analysis 
is the tool which is used most often to search for clusters of related traits. 
Unfortunately, the factor-analytic work with self-description items has 
not borne the useful results which have been found in studies of apti- 
tudes, interests, and attitudes, 

Correlations between self-description items tend to be low. Relatively 
unclear factor solutions are often obtained. Different investigators find 
different numbers of factors, The factors are often difficult to name, and 
different investigators use different names. Although there is still con- 
troversy about how many and what kinds of factors exist in human 
abilities, there is enough confirming evidence and agreement among 
investigators to talk as we did in Chapter 9 about the “structure” of 
human abilities. There is as yet no comparable factor structure which 
can be said to underlie self-description inventories. : 

The difficulty in finding self-description factors is not statistical but is 
due to the nature of social traits. Social traits are often combined in 4 
highly individual manner. In order to find substantial factors, it is neces- 
sary for a group of traits to go together in the same way for all persons: 
If, for example, there is a general trait of dominance, the person who is 
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dominant with women should be dominant with men, dominant at work 
and dominant at home, dominant when placed in one situation as well 
as in any other. However, people are often dominant and submissive in 
highly individual ways. The person who is dominant with women may be 
submissive with men, dominant at home, but submissive at work, domi- 
nant among strangers but submissive with people who know him well. 
The particularity of social trait patterns in individuals is not so marked 
as to spoil all efforts at determining trait clusters, but particularization is 
present to a degree that markedly hinders the successful use of self- 
description inventories. 

The Acceptability Influence. More important than the other roadblocks 
to the construction of self-description inventories is the fact that the re- 
spondent can and usually does control the responses for his own ends. 
There is a strong drive in all of us to appear socially acceptable and 
acceptable to ourselves. Acceptability in our society means being in- 
telligent, courageous, courteous, kind, dominant, and so on. It is extremely 
difficult for an individual to admit to himself that he is ignorant, cowardly, 
rude, mean, and submissive. It is even more difficult for an individual to 
admit these failings publicly. People in general tend to describe them- 
selves in rosy terms, to an extent that lowers the diagnostic value of self- 
description inventories. 

A study by Edwards (6) demonstrates the extent to which the “ac- 
ceptability influence” dominates the responses made to self-description 
inventories. In the first part of his study, 152 subjects rated the social 
acceptability of each of 140 personality trait items on a nine-point scale. 
Scale values for the items were determined, showing the relative nega- 
tive or positive social value attributed to the traits. Next, Edwards made 
the 140 items into a personality inventory. A group of students was 
asked to indicate “yes” for each item that characterized them and “no 
for items that did not characterize them. The proportion of persons 
answering “yes” to each item was determined. Edwards found a correla- 
tion of .87 between the judged social acceptability of items and a 
proportion of people who endorsed the items. This is strong ab es that 
people generally try to describe themselves in a socially acceptable man- 
her on personality inventories. , ; hat 

What is acceptable behavior for particular persons varies yee : 
in terms of age, sex, vocation, and social position. The soldier needs to 
be tough and stern, whereas the minister is more in need of the ai 
of kindness and patience. Respondents are usually aware enough of £ A 
characteristics needed in a particular situation to fake in the require 
direction, 

The strength of the acceptability 
tories depends on the degree to which the item alte 


influence in self-description inven- 
rnatives differ in social 
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acceptability and the punishments which result from not appearing 
socially acceptable. There is some variance in acceptability among the 
alternatives in attitude, opinion, and interest inventories, but not nearly 
so much as is involved in most self-description inventories. The punish- 
ment received for not appearing socially acceptable is cither embarrass- 


ment or failure to obtain a sought-after job or position 

The acceptability influence is less strong when respondents mark in- 
ventories in private, when they do not put their names on the forms, when 
assurance is given that all results will be kept anonymous, and when the 
inventory is unconnected with selection procedures of any kind. This is 
the situation that prevails in most research studies, and it is in research 
studies that self-description inventories have their most valid use. The 
acceptability influence becomes particularly strong when the friends and 
acquaintances of a respondent will see the responses or when the re- 
spondent is applying for a particular job. 

Efforts have been made to cancel out the acceptability influence by 
using alternatives of equal social acceptability. For example, the respond- 
ent is asked to rate whether he is more strong or intelligent, more im- 
patient or distrustful, and more calm or friendly, Employing items of this 
kind is referred to as the forced-choice technique. 

A forced-choice inventory which seems to meet with some success as 
a military screening aid is the Shipley Personal Inventory (16), The item 
alternatives were chosen to be of near social acceptability, A scoring key 
was derived by criterion keying to differentiate the responses of psy- 
chiatric patients from normal persons. A typical item is as follows: 


I have felt bad more from head colds. 
I have felt bad more from dizziness. 


The forced-choice technique is a promising lead in the construction of 
self-description inventories but by no means a cure for all of their faults. 
Even with the best efforts to equate the social acceptability of alterna- 
tives, the discerning person can often recognize the most desirable 
responses. 

Summary Evaluation. Extensive amounts of time, energy, funds, and 
hope have been invested in the construction of self-description inven- 
tories during the last thirty years, It is currently a major activity of many 
persons in psychology and kindred fields. The inherent difficulties of 
constructing valid inventories and the poor validities which the inven 
tories usually produce make many psychologists feel that further work 
along these lines would be futile. The great need to measure personality 
characteristics and the paucity of adequate measures should make us 
cautious about disparaging any well-intentioned efforts, However, the 
author is among those who believe that the successful measurement 0 
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personality characteristics will rest largely on testing procedures other 
than sel!-cescription inventories. 
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CHAPTER 15 


Projective Techniques 


The projective techniques offer an approach to the measurement of 
personality which is interestingly different from that of the self-description 
inventories. Whereas the self-description inventories require the subject 
to describe himself, the projective techniques require the subject to de- 
scribe or interpret objects other than himself. The projective techniques 
are based on the hypothesis that an individual’s responses to an “un- 
structured” stimulus are influenced by his needs, motives, fears, expecta- 
tions, and concerns. 

If there is an agreed-on public meaning for a stimulus, it is referred 
to as a structured stimulus. If there is no agreed-on public meaning for 
a stimulus, and in consequence there is considerable latitude for in- 
dividual interpretation, it is referred to as an unstructured stimulus. A 
structured stimulus is compared with an unstructured stimulus in Figure 
15-1. First, what do you see in picture A? Nearly everyone will say that 
it is a house. A few people might call it a school or even a jail, but people 
will generally agree that it is a dwelling of some kind. The shape of a 
house is a highly structured stimulus. Now what do you see in picture 
BP There is no accepted common meaning for that stimulus pattern. It 
might be interpreted as a thunderstorm, a dog, or an artist’s palette. 

There is a considerable body of evidence to show that interpretations 
of relatively unstructured stimuli are related to moods, needs, and ex- 
pectations. Studies of the effect of food deprivation show that hungry 
subjects more frequently interpret ambiguous drawings as representing 
food than do non-hungry subjects (9). In one study a completely blank 
screen was used and subjects were led to believe that faint images were 
being presented (10). It was found that the number of food responses 
increased as the interval of food deprivation lengthened. Other studies 
show that perception is influenced by values, social taboos, and personal 
conflicts (see 5). An experience common to most of us is the misreading 
of printed material in a way that indicates our concerns of the moment. 
For example, the student who worries over an examination coming the 
next day is likely to see in a hasty glance at the evening paper that “The 
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police will give the examination tomorrow,” whereas a careful reading 
will show that “The police will give the explanation tomorrow.” 

If the individual interprets picture B in Figure 15-1 as a thunderstorm 
rather than as a dog, this may indicate something about him personally. 
However, it is one thing to say that a response is significant and quite 
another thing to say just what it indicates about the person. 


Ficure 15-1. Comparison of a relatively structured stimulus (A) with a relatively 
unstructured stimulus (B). 


p 


(a) (6) 


The instruments which will be discussed in this chapter are primarily 
based on the interpretation of personality characteristics from responses 
to relatively unstructured stimuli. The ultimate in unstructured stimuli 
is found in one of the pictures in the Thematic Apperception Test (TAT). 
The subject is asked to make up a story about a completely blank card. 
Other stimuli are structured to some extent to obtain information about 
particular needs and concerns. For example, in an instrument used to 
study attitudes, a picture which shows a white person and a Negro talk- 
ing can be used. The stimulus is structured to that extent and in that 
manner in order to learn about attitudes toward Negroes. 

The term “projective tests,” which is used to refer to the kinds of 
instruments to be discussed in this chapter, is somewhat of a misnomer. 
The more strict meaning of the term projection is the tendency of an 
individual to see his own unwanted traits, ideas, and concerns in other 
people. For example, the individual who is himself a rather hostile per- 
son is often prone to see hostility in other people on the basis of little or 
no evidence. Here we will be concerned not only with projection in the 
Stricter sense but with the many ways in which the interpretation of 
unstructured stimuli can be used to detect personality characteristics. 
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RORSCHACH TECHNIQUE 


By far the most widely used projective technique is the Rorschach. 
It was developed during and after the First World ` by Herman 
Rorschach, a Swiss psychiatrist. He experimented with diferent ink blots 
to find a set which would provide the most insight into the nature of 

Ficure 15-2. An ink blot of the type used in the Rorschach Test. 


mental disorder. The 10 ink blots which he settled on are still in use. 
An ink blot like those used in the test is shown in Figure 15-2, Five of 
the ink blot cards are made in shades of black and gray only. Two of 
the remaining cards contain bright patches of red in addition to shades 
of gray. The three remaining cards employ various colors. 

Administration. The Rorschach, like the other projective devices, 
should be administered only by a highly trained examiner. The results 
will depend very much on the examiner’s skill. Although there are several 
schools of thought as to how the Rorschach should be administered and 
interpreted, a similar approach is used by most examiners. The most 
widely used approaches are those advocated by Beck (2) and by Klopfer 
and Kelley (8). j l 

The usual procedure followed in administering the Rorschach is to 
talk with the respondent a minute or so to gain rapport, seat him with 
his back to the examiner, and then introduce the task with approximately 
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the following instructions: “People see all sorts of things in these ink 
blot pictures; now tell me what you see, what it might be for you, what 
it makes you think of” (8, p. 82). The respondent is allowed to give as 
many interpretations as he likes. If he gives only one response and ap- 
parently tries to give no more, the examiner suggests that he look for 
other things, saying something like, “People usually see more than one 
thing.” A typical series of responses to one card would be for the subject 
to report, “It all looks like a bat” and “This part looks like a vase.” After 
looking at the card for a few more seconds, he might give as his final 
response, “This little bit looks like a nose.” 

After the respondent runs out of interpretations for the first card, he 
is handed the next one, and so on for all 10 cards. The examiner is kept 
busy ordering the materials, measuring the time taken to give each re- 
sponse, and writing down verbatim what the respondent says. After 
the series of ink blots is administered, the examiner conducts what is 
called the “inquiry.” This consists of asking the subject questions about 
each of his responses to learn what part of the blot is involved in the 
interpretation and why the particular part gives rise to the response. 

Scoring. There are a number of different kinds of scores given to each 
response. The major scoring categories relate to “content,” “determinants,” 
and “location.” The interpretation of a response depends on the scores 
received from all three categories. Consequently, only very general im- 
plications can be given to scores in one category alone. 

Some of the major categories of content are listed and illustrated as 
follows: 

Human. A response that would be scored human in content is “a 
man doing a dance.” 

Human Detail, e.g., “a nose.” 

Animal, e.g., “a bat.” 

Anatomy, e.g., “bones.” 

Clothing, e.g., “a nun’s habit.” 

Nature, e.g., “northern lights.” 

Sex, e.g., “a woman’s breast.” 

Landscape, e.g., “a lake.” 

Food, e.g., “a steak with mushrooms.” i i 

If a person gives almost all animal responses, this is taken as an in- 
dication of low “functioning intelligence.” A predominance of anatomy 
responses is usually interpreted as an indication of depressive tendency 
or of overconcern with health. If an individual gives many more responses 
in one content category than in others, e.g., many food responses, this 
is taken as evidence of an especial personality need. feds w 

Location refers to the part of the ink blot which instigates a alin ar 
response. This information is usually obtained in the inquiry. The re- 
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approach to life, he deals in sweeping generalizations considers 
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responses of the adult contain a pproximately 6w 
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The scoring of determinants is both more complex and more subjective 
and location. If the response is determined 
by the shape or outline of either the whole or part of the ink blot, A is 
scored as a form response (F), Form responses are further scored either 
if they of the kind that normal and 
superior people would give, or “minus,” if they are not adequate it 
terpretations. A response is scored M if movement is seen in the interpre 
f the response is apparently deter- 
by the color of the part which is interpreted, it is scored C. If the 
response is “sea water,” and in the inquiry the subject gives as his reason, 
“because it's green,” it would be scored as a color response A response 
is scored Y if it is determined by the shades of gray within a blot. A 
typical Y response is “dark clouds.” It is usually the case that a response 
must be scored on more than one determinant. For example, an FC score 
means that the response is determined mainly by form and that color 
enters as a secondary determinant. A CF response means that color is 
the primary determinant and form the secondary determinant. 
Form responses are thought to indicate a factual, rational approach 
to life. me responses are thought to indicate emotional reactions. If a 
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The terpretation of Rorschach responses is an extremely Benin 
task \/: hough there are reasonably clear-cut rules for scoring 
fesponses there are only general standards and examples to direct the 
final ii rpretation, The interpretation depends heavily on the subjective 
impression of the examiner. It takes about two years of practice, wally 
working in close collaboration with experienced examiners, to become 
Proficient at interpreting Rorschach responses. 
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Ficure 15-3. One of the pictures used in the Thematic Apperception Test. ( Repro- 
duced by permission of Harvard University Press. ) 


20 pictures is used, it usually requires more than one testing session to 
complete the series. In many cases the examiner uses only the half-dozen 
or so pictures which he thinks will elicit the most pertinent information 
from a particular subject. The subject is told to make up a story about 
each picture in turn. The instructions are approximately as follows: “I am 
going to show you some pictures. I want you to tell me a story about what 
is going on in each picture. What led up to it, and what will the outcome 
be?” The responses are either written down verbatim or a phonographic 
recording is made. 

Scoring and Interpretation. Murray originally scored TAT responses 
in terms of needs and presses. A need is something that the individual 
is trying to obtain. Some of the needs listed by Murray are achievement, 
aggression, nurturance, passivity, and sex. Presses are the outside in- 
fluences which help or hinder the attainment of needs, such as aggression, 
dominance, and rejection by other persons. Each story can be inspected 
for the needs and presses which are portrayed. 

Murray’s system of scoring also considered the hero in each story and 
the outcome. The hero is the person in the story with whom the subject 
apparently identifies. For example, responses to Figure 15-3 could con- 
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cern stories about either the older woman or the younger woman. Some- 
times the hero is a person outside of the picture. For example, a story 
about Figure 15-3 could concern a relative of one of the persons in the 
picture. 


Although needs, presses, hero, and outcomes are considered by most 
of the people who work with the TAT, apparently few examiners make 
detailed scorings of them. In fact, no formal scoring system of any kind is 


used by the majority of TAT examiners. The examiner interprets the 
responses in terms of his knowledge of personality and his experience 
with the instrument. Some interpretations are of a common-sense kind 
with which most examiners would agree. If a male subject imputes 
unfriendly motives and actions to all of the female characters, it strongly 
suggests that he is having troubled relations with women in real-life 
situations. If all of the stories end in disappointment, embarrassment, and 
failure, it is likely that the subject feels defeated and depressed. If the 
stories are lacking in passion and violence, even in pictures where strong 
emotion is the evident theme, it indicates that the subject is suppressing 


his own emotions. If any one type of social interaction, such as adultery, 
is seen in numerous pictures and in pictures where there is little to sug- 
gest it, this indicates that the subject is overly concerned about a par- 


ticular issue. Although there is no standard procedure of interpretation, 
responses to the TAT pictures provide many hints about the subject's 
concerns, his conception of himself, and the way he views his human 
environment, 

Special Uses for TAT Pictures. The original set of pictures developed 
by Murray and his coworkers is only one of many such sets of pictures 
that are presently in use. Pictures can be made for many different kinds 
of special studies. For example, a special set of pictures could be used to 
study the attitudes that management and labor hold toward one another. 
Pictures of persons dressed like foremen, workmen, union executives, and 
vice-presidents could be made. These persons could be posed in settings 
pertinent to the problems under study and structured to the extent that 
attitudes about certain kinds of social interactions would be elicited. Spe- 
cial sets of pictures have been composed for Negroes, Indians, and other 
racial and ethnic groups. Pictures of the TAT kind have been used to 
study anti-Semitism, family relations, attitudes toward military life, and 
many others. < 

Comparison of Rorschach and TAT. There is considerable competition 
between the Rorschach and the TAT as clinical instruments. Some psy- 
chologists prefer the Rorschach in general; other psychologists are equally 
biased toward the TAT. Most testers agree that there are some differences 
in the kinds of things which the two techniques measure, as well as some 
special advantages for each instrument with particular types of subjects. 
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Although both instruments require an expert examiner, it is somewhat 
more difficult to learn the administration and interpretation of the 
Rorschach. Although it is difficult for the subject to “fake” either of the 
techniques, it is probably more difficult to distort responses on the 
Rorschach. 

The Rorschach is not primarily concerned with content. That is, it is 
not very important whether a particular part of the blot is referred to asa 
snake or a rope. The important thing is whether or not the percept is a 
sensible interpretation and whether the percept is determined by the 
form or color of the area. Content is the meat of the TAT. If a character 
is described as a detective rather than as a salesman, it will markedly 
affect the interpretation. 

The Rorschach is thought to be valuable in getting at disturbances in 
the thinking processes, particularly in patients who are nearing psychotic 
breakdowns. The TAT is more concerned with social adjustment than 
with thinking processes. Although there is controversy about the point, 
there are many people who believe that the Rorschach is better for 
detecting and classifying psychotic disorders, and that the TAT is better 
with neurotic disorders and milder disturbances. In many cases both tests 
are given to a subject, and many clinical psychologists believe this leads 
to a better all-around diagnosis than could be obtained from either of the 
instruments separately. 


OTHER PROJECTIVE TECHNIQUES 


In addition to the Rorschach and the TAT, there are numerous proce- 
dures for evaluating personality which are best thought of as projective 
techniques. Almost anything that can be described, completed, or inter- 
preted serves to some extent as a projective test. We tend to read our 
concerns and expectations into everything we do. 

Word Association. An old technique for learning about personality is to 
have an individual associate words, Various lists of words (see 7; 13, chap. 
2) have been employed to get at particular kinds of reactions. The usual 
practice is to place certain emotionally tinged words among relatively 
neutral terms. The following list of words would be useful in studying the 
home and school adjustment of adolescents: 


1. Hair 7. School —— 
2..Mother — S._ 8. Love —— 
SrHOmée, an d, 9. Tree — 
4. Desk m. 10. Hate — 
a Book See 11. Paper — 
Or Eathér. ae 12. Shoe - 
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13. Fight ee 17. Body 

14. String ee 18. Me 

15. Sister ee 19. Brother 

16. Cake ee 20. Friend 

The words are read to the subject one at a time. He is asked to give the 
first word that comes into mind. The examiner records the responses and 
notes the time taken to respond. There are several ways in which the 
results are interpreted. The words which are associated often indicate the 


subject’s attitudes toward persons and activities. In obvious cases where, 
for example, the association to “father” is “spanking” and the association 
to “hate” is “father,” there is the strong suggestion of a negative reaction 
to the father, Equally revealing as the associated words are the emotional 
reactions to the initial terms. If the subject takes a relatively long time to 
respond or gives signs of being embarrassed, a strong emotional reaction 
is suggested. Thus, even though the subject eventually supplies an innocu- 
ous association for “father,” such as “hat,” the long time taken to respond 
would suggest “blocking” and underlying conflict. A third type of informa- 
tion used in the interpretation is the tendency to give unusual associa- 
tions, For example, most persons will associate “table” with “chair” or 
respond with some other related item of furniture. It is unusual to find 
the response “tiger” to “chair,” and when numerous such associations are 
made, it might be related to mental illness. 

Sentence Completion. A technique which is similar to word association 
is the use of incomplete sentences (see 14). Examples are as follows: 


I dislike most to 
I wish that I had never 
Most people are 
I become embarrassed 
The people I like most 


Apo re 


The usual procedure is to have subjects write their responses, per- 
mitting the testing of a number of persons at one time. The sentences pa 
be structured to provide information on different areas of adjustment. o 
effort is made to “time” the responses. Consequently, the subject as fp 
to make up whatever responses he chooses. The responses are ana na 
similarly to those on the TAT. That is, the moods, motives, ee a : 
expectations portrayed in the responses are appo in eee m 
subject’s personality. ; i i 

Play Teh A method which is especially ps ra a 
to employ play materials as a projective device ( see 3, c! ae nd almost 
betray their feelings quite readily in play activities of a n = pepes 
any set of play materials serves as a “test. Dolls and pupp! 
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most often for this purpose. The situation can be structured by, for exam- 
ple, naming the dolls “mother,” “father,” “little brother,” “me,” and so on. 
Also the environment for the dolls can be structured to some extent by 
having present toy implements, such as a baby bottle, a toilet, a bed, doll 
clothing, and others. The child is encouraged to play with the material 
in whatever way he chooses. A revealing set of actions would be for the 
child to place the doll for “mother” and “little brother” with their faces 
to the wall while “father” gives the bottle to “me.” Play activity of this 
kind is usually very rich in suggestions about the feelings and concerns 
of children. 

Drawings, Paintings, and Sculpture. Artistic productions can be used as 
projective devices. These may be almost completely unstructured like 
finger painting (see 1, chap. 14), or they may be structured to the point 
of asking for the drawing of a man (see 11). The advantage of the rela- 
tively unstructured task is that it is often not perceived as a test. A widely 
used procedure is to furnish clay to children and let them make whatever 
they like. The product can be analyzed for the symbolism apparent or 
simply in terms of the actions imputed to the clay figures, If the child 
makes a clay image of himself, it is important to note whether the bodily 
parts are in proper proportion. An outsized nose or excessively small arms 
might offer suggestions about the child’s concept of himself. Children often 
manifest strong emotions in artistic productions that they would not talk 
about openly, For example, a disturbed child might make « clay figure 
of mother, then run over her with the toy car, tear off the arms, and 
finally throw her in the wastebasket. 

Finger painting is used with both adults and children. The subject is 
presented with a variety of paints and encouraged to use his fingers to 
make whatever he likes. The way that the subject approaches the task is 
thought to be as important as what is drawn. Some persons are very reti- 
cent about putting their fingers in the paint or making a mess with the 
materials. Other subjects take a childlike glee in smearing the paints and 
making the biggest mess possible, The colors which the individual chooses 
are thought to be important. The proportioning, balance, and integration 
of the finished painting are also used in the interpretation. For example, 
it might be considered a sign of schizophrenia if the person makes only 
swirling smears with one color. 


EVALUATION OF PROJECTIVE TECHNIQUES 


Some adherents of projective techniques feel that their instruments lay 
open the depths of personality to observation. Rather sweeping generali- 
zations are often made about the response to an unstructured stimulus. 
Before the reader becomes alarmed about his own personality as it might 
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be mirrored in the projective techniques, a look should be taken at the 
factual basis for the instruments. 
Reliability. The ordinary measures of reliability are difficult to obtain 


with projective tests. If, as on the Rorschach, component scores are 
obtained prior to the interpretation, the reliability of these can be studied 
by split-half, equivalent-form, and other techniques. The reliabilities of 
single components on the Rorschach, like number of W responses and 
number of color responses, are less than those obtained with most tests of 
human ability but not so low as to render the indices unusable (see 4). 
There are two reasons why the reliability of projective techniques cannot 
be determined from component scores. First, most of the projective 
devices employ few, if any, scores for separate responses. Second, even 
if separate responses are scored, the interpretation is the final test result, 


not the initial scores, Consequently, it is necessary to test the reliability of 
interpretations from the test. 

Before studies of the reliability of interpretations can be made, it is 
necessary to develop objective procedures for recording interpretation. 
One way to do this is to have the Rorschach examiner make ratings of the 
subject on various personality traits (a handy procedure for doing this 
called the Q-sort will be discussed in Chapter 17). Correlations can then 
be made between the interpretations given by one examiner on two occa- 
sions and between the interpretations given by different examiners. Very 
few studies of this kind have been done. 

Beck ' found correlations between the interpretations by two examiners 
on numerous cases to range from above .80 down to zero. One examiner 
repeated the ratings of 10 cases a year after the first ratings were made. 
He did not remember making the earlier ratings. The correlations between 
the two interpretations ranged from .50 to .89, with a mean of 70. Al- 
though these are only a drop in the bucket to the studies which are 
needed, they indicate that the reliability fluctuates considerably for differ- 
ent kinds of subjects. Thus, whereas a child’s responses might be inter- 
preted with low reliability, an adult’s responses might be interpreted with 
high reliability; and the reliability of interpretations made about mentally 
ill persons might be higher or lower than interpretations made et 
normal persons. The problem is further complicated in that the ns z 
cases with which reliable interpretations are obtained from the Rorse ach 
may be cases for which only unreliable interpretations are obtained from 
the TAT or from other projective instruments. i 

Whereas it is expected and usually necessary that scores on most be E 
remain stable over moderate periods of time, it is not necessarily a e 
with the results of projective techniques. If the scores made by 2 os i 
an intelligence test fluctuate markedly over a period of six months 0 


1 n : 
Personal communication from S. J. Beck. 
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even several years, the use of the instrument would be seriously impaired, 
However, it is to be expected that some of the personality attributes 
mirrored in the projective techniques should sometimes change sub- 
stantially in relatively short periods of time. For example, the Rorschach 
or TAT responses of an individual would likely change from the beginning 
to the end of a successful psychotherapy. Instruments like the TAT are 
probably affected to some extent by day-to-day changes in moods and by 
good and bad turns of events. Consequently, it is dificult to untangle the 
expected changes in responses from the measurement error inherent in the 
testing procedures. 

Much more research needs to be done on the reliability of interpreta- 
tions derived from projective techniques. Equally important to learning 
the over-all reliability with which interpretations are made is the need to 
learn the kinds of traits which are more reliably rated and those which 
are less reliably rated. For example, it might be found for one projective 
technique that interpretations regarding moods are reliable (high agree- 
ment among examiners) and that interpretations regarding future actions 
are relatively unreliable. Studies of this kind would help define the kinds 
of traits which are most reliably interpreted from different instruments. 

It is doubtful that any one test can make the sweeping observations 
about personalities which are often claimed. Different techniques prob- 
ably have different strong and weak points as personality measures. 
Because an instrument is shown to be reliable, it does not necessarily 
mean that it is valid for any purpose; but if it can be shown that interpre- 
tations regarding certain kinds of traits are unreliable, it means that they 
cannot be valid in any sense. Systematic studies of the reliability with 
which different projective techniques lead to interpretations of various 
personality characteristics would help define techniques in terms of what 
they measure, at least reliably, and would narrow the field of investiga- 
tion for subsequent studies of validity. 

Validity. Projective techniques must be classified as predictors by 
default. It is difficult to argue that the projective techniques are assess- 
ments of personality. For example, if an individual calls an ink blot a 
butterfly, there is no reason to believe that this response represents any- 
thing about his personality unless evidence is provided to prove that such 
is the case, Consequently, the validity of projective techniques can be 
determined only by correlating interpretations with important behaviors 
outside the testing situation, 

In comparison to the many applications of projective techniques, there 
are very few studies of predictive validity, Consequently, no firm state- 
ments can be made concerning the validity of projective techniques as a 
group or about the validity of particular instruments. One of the problems 
is finding assessments for the projective testers to predict. The contrasted- 
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groups study has been used most often to validate projective techniques. 
In a typical study an examiner is given the Rorschach records of both 
normal persons and individuals who are diagnosed as schizophrenic. The 
examiner must decide from the test results alone whether each person is 
from the normal or the schizophrenic group. Studies of this kind indicate 
that the Rorschach and the TAT are moderately successful in differentiat- 
ing normal persons from the mentally ill. However, it should be pointed 
out that this is a very weak test of validity. The instruments are often 
used to make fine distinctions between people in the normal range and to 
differentiate among types of mental illness. 

One of the difficulties in validating and improving the projective tech- 


niques is an unwillingness on the part of some devotees to subject their 
instruments to empirical investigation. A kind of cultism which encourages 
faith in the instruments rather than a healthy scientific skepticism has 


arisen among some projective testers. They would like other people to 
accept their projective devices and the elaborate interpretations which 
they make as self-evidently valid. It is encouraging to see that many 
exponents of projective techniques are aware of the need for empirical 


investigations and are busily performing the necessary research. 
Standardization. A test was defined earlier as a standardized situation 

which provides the individual with a score. Do the projective techniques 

meet the requirements? In comparison to most tests, the projective tech- 


niques are relatively unstandardized. Although efforts are made to 
standardize the presentation of material to the subject, there are inevitable 
differences in the approaches used by different examiners. Much appar- 
ently depends on the way the examiner acts and the kind of person he is. 
With an instrument like the Rorschach, some examiners typically obtain 
more responses than other examiners, and women examiners sometimes 
obtain responses different from those obtained by male examiners. a 
The final results of projective techniques, the descriptions of individual 
personalities, are highly dependent on the intuitive judgment of n 
examiner. Not only are examiners unable to catalogue all of the rules 
which they use in reaching interpretations, but they are probably not 
aware of many cues which they employ. The examiner is not a pye 
who simply administers the test and as such plays a minor role in “a 
result. The examiner is part of the projective technique and inseparable 
from the test materials. Some examiners are undoubtedly more effective 
than others in deriving personality descriptions. Consequently, the oon 
ity of the technique is interwoven with the ability of the any He A 
uses it, As long as projective techniques depend so heavily on 3 e =e 
tions of examiners, they will not be tests in the stricter sense. Proje i 
testing is more of an art than a science, but this does not ey 
mean that the results are invalid. It does, however, force us to const er 
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the validity of separate examiners, and that immensely complicates the 
development of projective techniques for general use. 

Some efforts have been made to more fully standardix: projective tech- 
niques, particularly the Rorschach. Several multiple clicice Rorschach 
forms have been developed. A list of responses is presented with each ink 
blot. The respondent chooses the alternative response which most nearly 
matches what he sees in the ink blot. The alternatives for each blot were 
chosen from the records of both normal and abnormal subjects. Although 
the earlier forms provided only an over-all adjustment score, a more 


recent version (6) provides scores for different aspects of personality like 
those considered in the usual method of applying the instrument. Most 
Rorschach examiners prefer the traditional procedure to the multiple 


choice approach. There is not enough evidence to say how well the 
multiple choice instruments work in practice; but any systematic efforts 
to obtain more standardized procedures should be encouraged. 


Special Advantages. One of the foremost advantages of projective tech- 
niques is that most of them are difficult to fake. The subicct is usually 
unaware of how his responses will be interpreted. The person who tries 
to distort responses often gives himself away. It was mentioned earlier in 
connection with the word-association technique that an effort to cover up 
unpleasant feelings can usually be detected by the relatively long time 


taken to respond. Certain areas on Rorschach ink blots us ily generate 
sexual associations. If a person avoids sexual responses, it might indicate 
that he is concerned about sex. There are many other ways in which the 
experienced examiner can detect what the subject tries to hide, Even 
professional testers find it difficult to distort their own responses to pro- 
jective techniques. 

An advantage which is shared by most of the projective techniques is 
that they can be administered to persons of all ages, ethnic groups, and 
intelligence levels. The instructions are very simple and, in most cases, 
neither reading nor writing is required. The projective techniques are 
particularly applicable to children. Children who are unable or unwilling 
to discuss their problems directly usually react to the projective tech- 
niques as though they were games, 

Clinical Usefulness. The projective techniques are used primarily in 
clinical practice, They are not likely to receive wide use in school pro- 
grams or in the selection of people for particular jobs. The examiner must 
be an expert, and there are not enough of these experts to fill the needs 
of large testing programs. It takes the better part of a day to administer 
and interpret responses on the Rorschach and the TAT; thus, it would be 
excessively expensive to test large numbers of subjects in this way. 

Projective techniques are used in making decisions about people who 
seek psychiatric treatment and about mental hospital patients. Questions 
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must be «iswered about the nature and severity of the patients’ problems 


and the kinds of treatment which are likely to do the most good. These 
are very “nportant questions concerning human lives, and all available 
sources of information should be used. Decisions about patients must 
often be |.sed on no more than one or several interviews plus some back- 
ground iniormation. To the extent that the projective techniques actually 
help in making these decisions, they perform a very worthwhile service. 


Summa: Evaluation. The projective techniques are ingenious efforts 
to measure personality variables. Many of the interpretations that arise 


from them appeal to common sense and fit in with psychological theory 
also. Unfortunately, the techniques are relatively unstandardized, and it 
is difficult to determine how well they work. Some projective testers have 
made unwisely sweeping claims for particular instruments. The tech- 
niques arc often said to measure the “whole personality” and the “total 
behavior pattern.” It is doubtful that activities so circumscribed as re- 


sponding to 10 ink blots or making up stories about particular pictures 
will lead to such broad conclusions about the complexities of human 
personalities, The indicated directions for future research are to standard- 
ize the projective techniques and to determine the kinds of personality 
attributes which each measures most effectively. 
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CHAPTER 16 


The Future of Personality Measurement 


In the previous two chapters, we discussed the instruments which are 
currently being used most widely as measures of personality: self- 
description inventories and projective techniques. It was concluded that 
neither of these approaches to personality measurement has yet provided 


instruments of known validity. Whereas we have reasonably good meas- 
ures of interests, attitudes, and opinions (which some authors would 
classify as personality variables), few, if any, proved techniques are avail- 
able to measure social traits, degree of adjustment, and personality 


“dynamics.” This leaves a great gap in the storehouse of measurement 
tools in an area where measurement methods are most needed. Person- 
ality traits are, of course, very much involved in success and happiness 
in vocational and social life. Consequently, there is an urgent need to 


develop ax equate measures of personality traits. This chapter will discuss 
some of ‘ie directions in which the research in personality measurement 
seems likely to move. 
Instead of looking for new approaches to personality measurement, it 
is possible that self-description inventories and projective techniques can 
be refined to the point of providing adequate instruments. It was said in 
Chapter 14 that the results of self-description inventories are probably 
most valid when they are administered in research studies and not used 
to evaluate people or make decisions about them in any way. Self- 
description inventories may prove useful in the search for more valid 
personality measures, For example, it may be found that a physiological 
attribute correlates with the results of a self-description inventory, say to 
measure “dominance,” when the instruments are administered in a re- 
search setting. Whereas the self-description inventory would likely not 
be a valid measure in nonresearch settings, like personnel-selection pro- 
grams, because of the need and ability of individuals to “fake responses, 
the physiological variable might not be controllable by subjects. bat 
In spite of the cleverness and good sense involved in some o e 
projective instruments, the administration, scoring, and particularly 5 
interpretation are relatively unstandardized. They may provide vali 
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results in particular settings, used by particular examiners. and in respect 
to particular traits; however, it is not known when and now they work 
best and which examiners are more proficient at interpreting the results 
of projective instruments, Consequently, projective techniques cannot be 
used for general-purpose testing until they are more fully standardized 
and their validity tested for particular uses. 

Unfortunately, there seems to be only a moderate amount of research 


interest in the standardization of projective techniques. One attempt is 
the multiple choice Rorschach which was mentioned in Chapter 15. Many 
people feel that multiple choice forms and other standardized scoring 
procedures measure only the simpler human attributes and therefore lose 
much of the “richness” inherent in the free-response projective testing 
procedure. This criticism is probably justified in that it is much easier to 
invent objective tests for simple rather than complex attributes, But part 


of the criticism is probably due to a resistance on the part of some clinical 
psychologists to objective tests as such. The clinician prizes his intuitive 
ability and his function as an observer and interpreter of human behavior. 


If these functions are taken over in part by standardized tests, the clinician 
quite naturally feels that his particular usefulness is somewhat diminished. 
However, the more important problem of using measurement techniques 
to help unhappy people should receive precedence over professional likes 
and dislikes, Consequently, every effort should be made to refine and 
standardize the projective techniques. 


OBSERVATION OF BEHAVIOR 


Rather than infer personality characteristics from paper-and-pencil or 
projective tests, another approach is to observe people as they actually 
behave. As a simple example, a disturbed child can be observed as he 
plays with a group of children. If the observations can be reliably 
recorded and scored, they can be used as personality measures. The 
advantage of observational testing is that it has a real-life quality not 
shared by conventional testing instruments, 

Many everyday decisions about people are, of course, reached by a 
form of observational analysis. The football coach observes the freshman 
quarterback to see how well he passes, kicks, and runs with the ball. The 
new bank teller is judged in terms of his promptness, accuracy in main- 
taining accounts, and courteousness to customers. The new cook is judged 
by her cakes and pies and by the neatness of the kitchen. The following 
sections will discuss some of the ways in which behavioral observation 
can be used to measure personality traits, 

_ Ratings. Rating scales are used very widely as a means to record behav- 
ioral observations, Many industrial settings employ a standard rating form 
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which foremen use to record their observations of individual employees. 
Office mouagers use rating scales to judge the promptness, neatness, and 
job proficiency of clerical workers. Candidates for graduate school train- 


ing are :.\'ed by their professors in terms of general educational back- 
ground, intelligence, initiative, emotional stability, writing skill, and other 
traits. Ruting scales are used widely in military life to judge the suitability 
of men for promotions or for specialized training. A typical set of rating 


scales is shown in Figure 16-1. 
Ficure 16-1. A typical set of rating scales. 
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Ratings are often said to be more objective than self-description inven- 
tories. It is certainly true that other people are less sensitive about record- 
ing an individual's shortcomings than the individual himself would be. 
However, there are numerous pitfalls that beset ratings, and sis Sie 
these have been guarded against will ratings provide a valid picture o 
individual personalities. 3 ; 

One of the most common sources of error in ratings is lack of ial 
tion about subjects. Ratings usually will not be valid if raters see e su A 
ject only on several occasions or if they see the subject only in restricte 
environments. Army officers are sometimes called on to rate as many as 
100 men on the basis of two or three months’ performance i etl 
College professors often have more than 60 sireeni 1 “pe alight In 
first-hand acquaintance with any one of them is likely to e ey vith 
these and many other circumstances, ratings are sometimes made Jie 
only a small amount of information about the sie ers bi aie the 
Typically, in situations of this kind, the rater is p ia been a and 
extremes. The very good and the very poor men catch the ra’ ye, 
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he is able to make valid ratings about them. The majority of the people, 
however, do nothing either so meritorious or so wrong xs to make their 


presence known, and therefore, only very unreliable ratings can be made 
about them. 

In some situations the rater has considerable experience with subjects 
in a particular setting but little or no acquaintance wit!) them in other 
settings. For example, a college professor may have a good opportunity 
to observe diligence in the classroom but no opportunity to judge social 
behavior outside the classroom. The first step in arriving at valid ratings 
is to make sure that raters have both an extensive acquaintance with sub- 


jects and that they have witnessed behavior in situations relevant to the 
traits being rated. 


In addition to lack of information about subjects, a number of forms 
of bias often enter rating scales. One of these is personal bias toward 
particular individuals. It is difficult not to give better ratings to our 


friends than to people whom we do not know or dọ not like. Another 
source of error is a constant bias on the part of the individual rater to 
rate all persons generally high or generally low. If a number of raters are 
asked to rate 30 men, they will differ to some extent no't only in their 
orderings of the men but also in their mean ratings. Some raters have a 
positive bias, tending to rate all persons near average or above. Other 
raters have a negative bias, giving a preponderance of low ratings. 

A form of bias called the “halo effect” consists of giving all bad, all 
average, or all good ratings to people. Rather than thinking differentially 
about the strong and weak points in persons, we tend to dev elop general 
attitudes toward individuals. Consequently, the individual rater is prone 
to rate a person in much the same way on different rating scales even 
when that is not the true picture. For example, the soldier who is 
good in drill and maintenance of equipment is not necessarily good in 
marksmanship, 

An important point to note about ratings is that they are often unreli- 
able. The reliability of ratings can be studied by all of the methods dis- 
cussed in Chapter 6 in the same way that the reliability of mental tests is 
determined. The preferred procedure is to have the same individuals 
rated by two raters either on the same or an equivalent rating form. The 
two sets of ratings can then be correlated to obtain an estimate of the 
reliability coefficient, The coefficients obtained for ratings in this way are 
seldom as high as .80, often run below .50, and go all the way down to 
zero in some cases, 

A number of things can be done to improve ratings. One, which was 
mentioned previously, is to provide more and better opportunities to 
observe individuals, Secondly, raters should be trained for their jobs, told 
about the various forms of bias in ratings, and given extensive practice in 
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observing and rating individuals. Thirdly, a substantially more accurate 


picture of the subject can usually be obtained by averaging the ratings of 
two or more raters, This serves to iron out the biases of individual raters 
and results in a more reliable measure. Using multiple raters improves the 
reliability in much the same way that increasing the number of items 
improves the reliability of tests. Nunnally * recently determined the reli- 
ability with which four psychiatrists rated the symptoms of 21 cases. The 
median reliability for individual psychiatrists was less than 40. When 


ratings were averaged over the four psychiatrists, the median reliability 
rose above .80. 

Some personal skills, like military bearing, “bedside manner” of physi- 
cians, anc! leadership qualities, are probably so complex that they can 
best be measured by the equally complex judgmental processes of human 
observers. However, we often make decisions about people on the rather 
unreliable, often biased, judgment of only one rater. One rater is seldom 
sufficient. Two or three raters provide a more reliable measurement. The 
gain in precision from using more than four raters is seldom worth the 
effort. 

Instead of having ratings made by experts or by administrative officials, 
another approach is to have the workers or trainees rate one another. This 
is usually referred to as the “peer rating” technique. In a typical use of 
peer ratings, officer candidates in the Armed Forces are required to rate 
the other men in their unit in terms of their abilities and leadership quali- 
ties. The advantage of the peer rating technique is that the men usually 
have a large amount of first-hand experience with one another. Validity 
studies indicate some promise for peer ratings. 

Situational Tests. In addition to standardizing the rating of behavior 
observations, some efforts have been made to standardize the situations 
which are observed. One of the pioneer efforts of this kind was the screen- 
ing program developed by the Office of Strategic Services gi phe 
Second World War (12). The purpose was to select men for military 
intelligence work, espionage, and other dangerous assignments. In wA 
tion to tests of intelligence, memory, and mechanical ability, each CRUG 
date was given a series of “situational tests.” Each situational test involved 
4 type of problem that might be encountered in actual duty. ices’ 
date’s performance was rated in terms of ability to think quickly an 
effectively, emotional stability, and leadership. f 

In one of the situational tests, the candidate was given t 
Constructing a 5-foot cube from wooden blocks, poles, an ` 
candidate was told that, since it was impossible for one person GA 
plete the task in the ten minutes allotted, he would be supplied with two 


e i . Ro; 
‘Unpublished results found by Nunnally while working as consultant to Dr. Roy 
Grinker on a study of depressed patients. 


he problem of 
d pegs. The 
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helpers. Actually, the two “helpers” were psychologists whose purpose 


was not to help but to hinder in all possible ways. One “helper” acted lazy 
and confused; the other bickered, argued, and criticized. The “helpers” 
were so effective that no one ever completed the task. The real purpose 
was, of course, to observe the candidate’s behavior in a tense, frustrating 
situation. 


In another OSS situational test, the candidate was told to imagine that 
he is caught going through files marked “secret” in a government office, 
that he does not work in the particular building, and that he carries no 
identification papers. The candidate was given twelve minutes to con- 
struct an alibi for his presence in the suspicious circumstances. Then he 
was subjected to a harrowing interrogation, in which attempts were made 
to break his alibi and to make his statements appear foolish. The candi- 
date was rated on the convincingness of his story and his ability to support 
it in the interrogation. 

Because the OSS measurement program operated under considerable 
time pressure and because it had to fulfill the immediate {ask of evaluat- 
ing men, some of the procedures were relatively crude. There was little 
opportunity to measure the effectiveness of situational tests 2s predictors 
of later performance. One of the difficulties was finding adequate criteria 
of performance against which to validate the tests. The rnajor criteria 
were ratings by field commanders and by fellow workers. These ratings 
proved to be of low reliability, and consequently, correlations with situa- 
tional test scores were quite restricted. The effectiveness of situational 
“ests in the OSS program is still a matter of controversy. 

The OSS selection program received wide attention, and since the 
Second World War, a number of attempts have been made to employ situ- 
ational tests. One of the best known of these was a five-year project whose 
purpose was to develop instruments for selecting student trainees in 
clinical psychology (9). Potential trainees in Veterans’ Administration 
hospitals came from 30 universities scattered across the country. They 
were housed at the University of Michigan during the testing period. In 
addition to tests of ability and personality, a number of situational tests 
were used, 

The situational tests used in the Michigan study were modeled directly 
after those employed in the OSS program. One consisted of a discussion 
situation in which the candidates were told to act as a citizens’ committee 
and discuss the use of a large financial bequest. A second situational test 
involved improvising responses in a social situation. A third consisted of 
a group problem-solving task involving the moving of a set of heavy 
blocks. The fourth test required the individual candidate to demonstrate 


the emotions expressed in poetry and in particular words, with gestures 
and facial expressions only. 
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Whereas there was little opportunity to study the validity of situational 
tests used by the OSS, an extensive study of validity was made in the 


Michigan project. Criterion ratings were obtained from university and 
clinical s»pervisors some three years after the original testing. Trainees 
were rated in terms of clinical competence and preference for hiring. The 
ratings made earlier in regard to performance in situational tests corre- 
lated .35 with clinical competence and .20 with preference for hiring. 
However, standard printed tests worked as well. Situational test results 
added nothing to the predictive efficiency which was obtained from 
objective test results and personal history information. 

There are numerous practical drawbacks to the use of situational tests. 
They often require elaborate equipment, much work space, and many 
persons to administer the tests. It is often necessary to use trained actors 
in the tests and it is difficult to standardize their performance. Much 
pretesting must be done on the test design, and raters must practice their 


jobs. Only one or only a few men can be tested at once, and it is time 
consuming to place large numbers of men through the tests. It is difficult 
to hide the nature of the test situations from new subjects. The persons 
previously tested often pass on information to people who will later take 
the tests. If the situations are not changed often, with the result of much 
additional work, or if elaborate precautions are not made to keep secret 
the nature of the situations to be used, new subjects will “rehearse” their 
behavior prior to the tests. 

In general, situational tests are cumbersome, expensive, and difficult 
to standardize. Only if they produce a marked gain in predictive validity 
over more conventional measurement techniques will they be worth the 
effort. The situational tests employed so far have not lived up to that 
standard. The evidence so far indicates that ratings made in respect to 
a longer-term general acquaintance with a person are more valid than 
ratings made about behavior in contrived situations. è 

Behavioral Tests. Another method of behavioral observation is to col- 
lect some objective product of activity in lifelike situations. The test con- 
cerns what an individual actually does rather than ratings of his behavior. 
One of the earliest and still the best-known use of behavioral tests was 
that of Hartshorne and May (8) of the Character Education my 
They wanted to measure traits in school children, such as honesty, trut i 
fulness, cooperativeness, and self-control. Rather than use conventiona 
tests or ratings to measure these characteristics, they chose to observe 
the actual behavior of children in respect to the traits. The observations 
were conducted in the normal routine of schoo. 
reation, and classroom work. 

Observations were made in respect to each tra 
provide an objective score. For example, one of th 


| activities, in athletics, rec- 


it in such a way as to 
e measures of cheating 
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was obtained by allowing students to grade their own papers and noting 


the number of alterations of answers. Tests like vocabulary and arithmetic 
reasoning were administered in the classroom. The tests were collected 
and a duplicate copy made of each. The original unscored papers were 
returned to students along with a list of the correct answers. Children 
scored their own papers and either gave themselves the correct grade or 
altered answers to improve their standings. Scores reported by students 
were then compared with the scores students actually attained, as meas- 


ured by the duplicate test copies. The amount of discrepancy between 
the two scores provided a measure of cheating. 

Another of the Hartshorne and May tests concerned the trait of charity. 
Each child was given an attractive kit, including 10 articles such as pencil 


sharpener, eraser, and ruler, After the children had examined the mate- 
rials for some time, they were allowed to give away some or all of the 
items to “less fortunate children.” The children were not coerced to 
donate, and the donations were made anonymously. Each child was pro- 


vided with a large envelope in which to put his donation, and the dona- 
tions were dropped into a common box. Unknown to the children, 
envelopes had been marked in such a way as to identify the donations 
of each child. The number of articles donated served as measure of 
charity. 

When behavioral tests of the kind originated by Hartshorne and May 
can be used, they have a number of attractive advantages. The use of 
actual behavioral products frees the measurement procedure from the 
subjectivity of rating scales. If observations can be made in natural situ- 
ations where the subject is unaware that he is being tested in any sense, 
the results are probably more valid. One of the faults of situational tests 
is that they are more like games than real-life situations, and subjects 
often regard them as such. 

Other than the Hartshorne and May studies, there have been few 
attempts to develop systematic behavioral tests, although much the same 
thing is done informally in many evaluational efforts. The major uses of 
behavioral tests have been with children. This is because children are 
easier to find in group situations or easier to place in group situations 
without having them Suspect an ulterior purpose. Also, the behavioral 
products of children are usually simpler and more easily measured than 
complex adult interactions. Behavioral tests are sometimes used with 
kindergarten and nursery school groups. Behavioral records can be kept 
either surreptitiously by an examiner, or the children can be watched 
through a one-way vision screen. Notes are made of the number of times 
a child offers a toy to others, the number of times he asks for adult help, 
and other relevant behavior. In order to get at more complex responses, 
it is sometimes necessary to use both direct recording of actions and rat- 
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ings of behavior to measure such traits as “responsiveness” and “tendency 
to withdraw.” 

An adaptation of the behavioral type of test is presently being used to 
measur’ the social responsiveness of mental patients. The patient is 


placed in a standard interview situation and records are made of actual 
behavior. One such routine is referred to as the Minimal Social Behavior 
Scale (5). A typical “item” consists of offering the patient a pencil and 
noting whether or not it is accepted. In another “item” the examiner 


places « cigarette in his mouth and fumbles for a match. He stands up 
and looks through all of his pockets. A book of matches sits in plain view 
of the patient. The patient is scored “plus” if he mentions the matches on 
the desk or offers a match from his own pocket. A series of such behavioral 
observations provides a picture of the patient’s social responsiveness. 


MAXIMUM PERFORMANCE TESTS 


It has always been hoped that one day we should have tests, in the 
stricter sense of the term, to measure personality characteristics instead 
of having to use instruments like self-description inventories, which are 
controliable by subjects. When an instrument concerns how much an 
individual knows, how many problems he can solve, or how well he per- 
forms, we tend to think of it as a measure of some intellectual character- 
istic as distinct from personality characteristics. However, it is entirely 
possible that personality characteristics are involved in what appear to 
be “ability,” “cognitive,” or “maximum performance” type tests. 7 

Perceptual Tests. It has always been thought that perceptual tests, like 
those mentioned in Chapter 9, measure not only certain ability factors but 
relate to personality characteristics as well. One of the earliest attempts to 
test this hypothesis was Thurstone’s (10, pp. 137-141) study of govern- 
ment administrators. He administered a wide variety of perceptual and 
other tests to both experienced government administrators and to rela- 
tively inexperienced officials. He categorized the subjects in each age 
group according to their salaries, assuming that salary related at least 
roughly to administrative competence and perhaps this in turn to leader- 
ship ability. He found that among other tests which served to differentiate 
the salary groups, one of the perceptual tests appeared to pr oduce a sig- 
nificant differentiation. The particular test, the Gottshaldt F igures Tests, 
is concerned with the ability to detect visual patterns imbedded ames 
complex figures. The finding is now mainly of historical value, but i 
serves to illustrate the interest in relating perceptual functions to person- 
ality variables. i 

One of the most interesting studies relating perceptual ela i 
personality characteristics is the work of Eysenck (see 1) on “dar 
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vision.” Dark vision is, quite simply, the ability to see well at night, in 
contrast to the usual criterion of good vision, the ability to see well in the 
daytime or in lighted rooms. It has been known for some ‘ine that dark 
vision is not necessarily related to day vision. Persons who can see quite 
well in the daytime often have great difficulty in seeing well at night, 
and vice versa, Eysenck compared the dark-vision ability of 96 neurotic 
patients with 6,000 apparently non-neurotic members of the British RAF. 
On a scale ranging from 0 to 82, the neurotics had a mean score of TI 
and the normals had a mean score of 19.3. Additional investigations 
showed that the dark-vision test tends to discriminate among neurotics 
in terms of the severity of their disorders. 

Another perceptual measure which is currently receiving some atten- 
tion as a possible test of personality characteristics is the color-form domi- 
nance test. The test consists of presenting on film visual images of a 
composite nature so that the subject can report a perception of move- 
ment which is determined either by colors or by form, or shapi only, It is 
found that some people characteristically see the color-detornined per- 
cept and others characteristically see the form-determined percept. The 
evidence so far is insufficient to say whether or not color-form dominance 
relates to particular personality characteristics. 

Persistence Tests. Tests of the maximum performance type have been 
used to measure how long individuals will work at difficult or even 
impossible problems. These supposedly measure a trait of persistence, 
which in everyday language means the tendency to stick with problems 
when the going is tough. In a typical test, the subjects are presented with 
a very difficult mechanical puzzle. Some subjects will give up after a few 
failures; other subjects will work on in spite of numerous failures. Similar 
tests can be constructed for perceptual, verbal, motor, and numerous 
other ability functions. Studies to date (see 2, pp. 284-290) indicate that 
there is a general trait of persistence underlying many of the tests. The 
trait correlates to some extent with intelligence test scores but correlates 
more highly with school grades; this indicates that an important per- 
sonality attribute is at least being touched on in persistence tests. 

Response Sets. It is possible to differentiate people not only in terms 
of the test scores which they make but in terms of the way in which they 
take tests as well. Regardless of what the test is about, people bring with 
them certain test-taking habits, or “response sets,” which, in part, deter- 
mine their scores. One of the oldest observations along these lines is that 
personality, or as it was called in older days, “temperament,” is involved 
in psychophysical studies of judgment. The subject can be asked either to 
judge whether stimulus A is greater or less than B or he can be provided 
with a third choice, an “indifferent” category, in which the judgment is 
“neither less nor greater.” It was noted early that some subjects use the 
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“indifferent category” much more than other subjects do, regardless of the 


judgments being made, indicating the presence of a personality trait seem- 
ingly involving the “willingness to take a stand.” 

One type of response set which has received some attention lately is 
that of “a quiescence.” Acquiescence is usually studied with either very 
ambiguous or very difficult statements like the following: 

The inoons of Saturn were first dis- 

covered by Gustav Whittenborn. Yes No 
An excess of porthymadrone results 

in dilation of the pupils. Yes No 


Very few persons are likely to have any first-hand information about 
statements like those above. Although both of them are false, they might 
as well be true as far as most persons are concerned. When presented 
with a longer list of such questions, some persons characteristically agree 
(answer ‘ves”) and others show a tendency to disagree. People who give 


a preponderance of “yes” answers are said to acquiesce. That is, they seem 
to be pushed into agreement by the force of the statements alone. 

There ave numerous other response sets which can be measured, When 
a rating scale is used instead of a simple “yes” or “no” answer, people 
tend to pile up their answers at different points on the continuum. If, for 
example, opinions are being studied with a seven-point continuum which 
ranges from “strongly agree” to “strongly disagree,” some people will most 


often mas‘: the extremes and some people will put most of their marks in 
the middle of the continuum. 

Response variability is another type of response set which is geen 
being studied. This is measured by comparing scores made by the smug 
people on the same or similar tests given on two occasions. The instru- 
ments can be either tests of the ability type or attitude, opinion, and per- 
sonality inventories. Some persons will give much the same response on 
the first and the second testing; other persons will vary their answers 
considerably. There is some evidence (see 2, pp- 290-293) to indicate thet 
response variability is related to neuroticism and other personality 
characteristics. ; itsi 

Knowledge Tests. One approach to the measurement of social yene 
to test people’s knowledge about particular kinds of social ei self. 
For example, rather than try to measure friendliness oe oe 
description inventory, subjects can be tested on their knowle io x A 
tesy, the meaning of certain social behaviors, and the T i ave! 
people. Although instruments of this kind have not generally 
cessful, more systematic efforts might meet with success. : Want 

One example of an information test which proved useful in measuring 

So eRe e Onan Test developed by Glick- 
personality functions is the Naval Knowledge Te: 
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man (6). The test was constructed for the purpose of predicting the 
success of naval officer candidates in training school. A battery of con- 
ventional tests of the aptitude type was being used successfully at the 


time, and it seemed that only by the addition of tests of the personality 
type could substantial gains in validity be made. Although the Naval 
Knowledge Test is ostensibly a measure of factual information about 


naval customs, equipment, and history, it apparently measures some 
components of personality and interests. The sort of person who would 
interest himself in facts about the Navy and naval history before entering 
the service is likely to be someone who will like naval life and perform 
at least moderately well on actual duty. The test added substantially to 
the predictive efficiency which was obtained from the aptitude battery 
alone. 

Suggestibility Tests. It has long been thought that there is a connection 
between suggestibility and neurosis. One of the most wid ly used meas- 
ures of suggestibility is the “body sway test.” In the test, the subject 
stands blindfolded, and the experimenter says, “You are falling forward, 
you are falling forward . . .” Under these conditions some people will 
sway very little or not at all and others will sway to the extent of actually 
falling. The amount of sway is measured by mechanical equipment. Using 
this type of testing procedure, Eysenck (2, pp. 296-299) found a definite 
relationship between suggestibility and neuroticism. Not only was there a 
significant mean difference in suggestibility between neurotics and nor- 
mals, but amount of suggestibility differentiated among neurotics in terms 
of the severity of their disorders, 


PHYSICAL AND PHYSIOLOGICAL MEASURES 


It is a truism to say that the way a person is physically constituted has 
a strong influence on his personality. Although much of what we call 
personality is the result of long and complex learning experiences, much 
also is apparently determined by the particular muscular, neural, and 
chemical make-up of the individual. In so far as relevant physical and 
physiological traits can be measured and validated, they can be used 
as tests. 

There is a wealth of evidence to demonstrate connections between 
physical states and personality. When people are sick, they seldom 
behave the same as when they were well. They are often more easily dis- 
turbed, quick-tempered, and depressed. Ordinary medicines which are 
taken for “colds” and other disorders affect our moods and social behavior. 
Different drugs tend to have different effects. One drug tends to make 
people elated and another tends to depress. The impact of neural func- 
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tions on personality is often witnessed in the altered behavior of people 
with brain damage. Even the relatively mild physiological impact of the 
changing weather seems to alter our moods and social behavior. 

Physique. It has long been thought that different types of body build 
accompany different types of personalities. The prevalent social stereo- 
type is that short, fat people are friendly and jolly, whereas tall, thin 
people are thought to be shy, and “serious.” There have been numerous 
investigations during the last half-century (see 2, chap. 5) both to cata- 
logue body types and to relate these to personality characteristics, The 
typology which is used most often is that of the endomorph, or short, fat 
build, the ectomorph, or tall, thin build, and the mesomorph, or athletic 
build. Very few persons fall neatly into one of the types, and conse- 
quently, each person must be considered some compound of the three. 

Whereas there are evidently some relationships between body type and 
personality characteristics, the relationships are not as strong as was 
originally supposed. One of the major findings is that, among the 
mentally ill, schizophrenics tend to be ectomorphs and manic depressives 
tend to be endomorphs (2, chap. 5). There is also some support for the 
oft-held contention that ectomorphs are more often introverted and endo- 
morphs more often extraverted (2, chap. 5). Perhaps the next decade of 
research will show more clear-cut relationships between physique and 
personality characteristics. 

Autonomic Activity. The autonomic nervous system regulates the vary- 
ing states of bodily alertness, reaction to fear, accommodation to sleep, 
and many others. In spite of the many suggestions that people differ in 
autonomic characteristics and that these in turn are related to social 
behavior, there has been only a relatively small amount of research in 
this area. This is probably because autonomic and related physiological 
variables require rather elaborate and time-consuming measurement 
procedures, : 

Some of the autonomic and related physiological variables which have 
been studied so far are electrical skin resistance, blood pressure, metabolic 
rate, rate of salivation, muscular tonus, respiration rate, and pulse rate. 
The two kinds of studies which have been employed so far are 1) to 
determine individual differences in autonomic activity at one period in 
time and 2) to determine individual differences in autonomic response is 
experimental variables, such as emotional situations, thirst, and electrica 
shock, 

The evidence is yet too meager to say with any 
ality characteristics relate to autonomic activity Nee 
following suggestive results are offered mainly to indi 
attempted. 


conviction how person- 
(see 2, chap. 6). The 
cate what is being 
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In terms of the first type of study mentioned above, there is some evi- 
dence to indicate that mobilized autonomic states (high blood pressure, 


salivation, muscular tonus, etc.) are related to neuroticism and person- 
ality inventory measures of shyness, introversion, and em tional instabil- 
ity. In an interesting study of experimental changes in autonomic activity, 
Freeman and Katzoff (5) measured the effect of sudden loud noises, 
electrical shock, and others on autonomic mobilization and recovery, 


They found rate of recovery to be significantly related to psychiatric 
ratings of emotional stability. 

Blood Chemistry. The blood stream carries not only the nutrients and 
regulators of bodily activity but substances which influence our moods 
and social behavior. In so far as chemical differences among people can 
be shown to relate to personality differences, a “blood test” might prove 
to be a perfectly sensible approach to predicting certain kinds of social 
behavior. Most of our knowledge about the relationship between social 
behavior and blood chemistry comes from introducing chemicals into the 
blood and noting their effect. Experimental studies of the effect of drugs 
and glandular injections on social behavior exemplify this approach. How- 
ever, it is more interesting, from the standpoint of psychological measure- 
ment, to search for existing chemical differences among people and to 
relate these to social behavior. 

The use of chemical indices to measure personality characteristics has 
received great encouragement from recent findings about schizophrenia. 
A number of independent investigators (see 11) have isolated a copperish 
substance which is present to a marked degree in the blood of most 
schizophrenic patients but present to only a slight extent in most normal 
individuals. Other studies show that when a substance from the blood of 
schizophrenic patients is injected into normal persons that a transitory 
state of schizophrenia develops which is indistinguishable from the 
behavior of real patients. These findings may lead not only to new 
methods of treating schizophrenia but to a method for detecting schizo- 
phrenia as well. 

Summary of P hysical and Physiological Measures. It is doubtful that 
the fine points of personality will be measured by strictly “physical” varia- 
bles alone. The evidence so far is that physique, autonomic activity, and 
blood chemistry may eventually define some broad personality tendencies 
in people. As the present findings regarding schizophrenia indicate, physi- 
cal measures may prove useful in predicting whether or not people will 
develop severe mental disorders, However, it is doubtful that physical 
measures alone will be the most efficient way to make more complex pre- 
dictions, such as whether two people will make good marriage partners 
or whether an individual will succeed as a physician. 
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BIOGRAPHICAL AND PERSONAL DATA INFORMATION 


As a final approach to the measurement of personality, we can return 
to the common-sense method that is used in everyday life. Without any 
more direct way of forecasting how an individual will behave in the 
future, the best bet is that he will continue behaving in the same manner 
that he has in the past. This is the principle upon which so much of our 
knowledge of other people is based. By observing our family and friends 
in many situations, we gradually learn to predict how they will behave 
in new situations. The person who has behaved shyly in numerous social 
situations in the past is not likely to switch to being the “life of the party.” 
The individual who has a long history of fights and arguments is likely to 
continue in that vein. The person who has shown jealous behavior in 
numerous previous situations is likely to act jealous in new situations. 
People do change, of course, but seldom so markedly or so rapidly as to 
render them grossly unpredictable to friends and family. 

Although we seldom keep systematic records of the validity with which 
we can predict the behavior of others, at least we feel that our forecasts 
are fairly accurate. There are, of course, many exceptions, which are evi- 
denced in the oft-heard phrase, “I didn’t think he would do that.” It would 
be an odd world, and a frightening one too, if we were not able to predict 
the behavior of others with some accuracy. We can depend partly on cul- 
tural mores and customs, which force most of us into relatively narrow 
channels of social response, For example, if you say “Good morning to 
an acquaintance, it can be predicted with fair accuracy that he will ony 
“Good morning” or some variant of it in return. If he replied with “Fish 
for sale” or some other highly unpredictable response, you would prob- 
ably attribute it to mental illness. Beyond the confines of culturally defined 
actions, we also learn to predict with some accuracy the idiosyncrasies 2 
particular persons, We do this mainly through numerous observations 0 
behavior in different situations. $ 

Rather than accumulate information about people on an informal basis, 
a systematic accumulation of facts can be made. Either the person can be 
“investigated,” or the individual himself can be asked to supply pertinent 
information. This is essentially what is done in many job wen 
blanks. The individual gives information about places of residence, healt i 
work history, military experience, hobbies, personal gece ra 
others, Although the application blank is usually not employed as a test, 
it can be standardized and scored. 

Although it might prove useful in t s 
information to measure general personality traits, 
have been in selecting personnel for particular 


he future to collect biographical 
most of the uses so far 


jobs. The “test” is con- 
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structed much like any other predictor instrument, A large collection of 
“items” is made. Each item concerns a personal history event, action, or 
accomplishment. The items would, of course, likely be different for differ- 
ent personnel-selection situations. Some items which might be useful to 
predict the success of real estate salesmen are as follows: 


1. How many jobs have you had during the last five 
years? 

. How long were you employed in your last job? 

» How many years of schooling have you completed? 

- Do you own your own home? 

. How long have you lived in this city? 

. Are you married? 

- Do you belong to a fraternal organization or club? 

. How much money do you have in the bank? 


SADA op 


Here we are only guessing what might be important characteristics of a 
successful real estate salesman. The real “proof of the pudding” is to learn 
which items differentiate successful salesmen from unsuccessful salesmen. 
A longer list of biographical items like those above could be administered 
to job applicants. One or several years later, measures of job success could 
be obtained, which might include ratings, amounts of property sold, and 
other pertinent indices of success, Then item-analysis procedures like 
those discussed in Chapter 8 could be performed on the biographical 
items to obtain an efficient predictor instrument. This might show that 
successful salesmen are usually married, own their own homes, and have 
been residents of the particular city for a long time. The most valid items 
can then be used to form a test, which can henceforth be used to select 
personnel. The finished product is called a biographical inventory, per- 
sonal data form, or personal history record. 

Biographical inventories have been employed successfully in a number 
of selection programs. One of the most striking examples is that of the 
Aptitude Index (see 4, pp. 282-286) developed by the Life Insurance 
Agency Management Association to select prospective insurance sales- 
men. Part I of the Index consists of 10 biographical items concerning net 
worth of the applicant, age, social affiliations, and others. Weighted scores 
on the 10 items are averaged and transformed into letter grades ranging 
from A through E. Validation studies (see 4, pp. 282-286) show that 
applicants who score A on the test sell between 300 and 400 per cent more 
insurance during the first working year than applicants who score E. 

Biographical inventories were among the most valid predictors of the 
successfulness of pilot trainees in the Army Air Force during the Second 
World War (see 7, chap. 27). The inventories concerned background 
information on parents, social affiliations, employment, hobbies, athletic 
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accomplishments, and schooling. One form correlated .42 with the success 
of pilots in training. 

You might wonder why a self-report test like the biographical inven- 
tory works well as a selection instrument whereas self-report instruments 
of similar appearance, such as self-description inventories, generally do 
not work well. Biographical inventories have a number of advantages over 
self-description inventories. It is difficult to construct unambiguous items 
for self-desc ription inventories, but it is relatively easy to construct mean- 
ingful biographical items. The subject is asked his age, how long he has 
been married, and other understandable questions, It is apparently easier 
to hypothesize and compose valid biographical items than valid self- 
description items. The research effort needed in the former is therefore 
considerably reduced. Most important, whereas self-description inven- 
tories are quite often invalidated by the amount of faking in selection sit- 
uations, biographical inventories are affected little by faking. The reason 
for this is that most of the biographical items are factual matters of the 
kind that can be checked on by others. An applicant who might be per- 
fectly willing to say that he loves his father more than his mother in order 
to appear in the best light would be quite reluctant to say that he is 
married when he is not or that he belongs to the Elks Club when he does 
not. There is some stretching of facts, such as the addition or subtraction 
of years from one’s age, on biographical inventories, but this is seldom 
so gross as the amount of distortion encountered in self-description 
inventories. 

The biographical inventory is probably the best measure of “person- 
ality” presently available for selection programs. Although it is not certain 
what kinds of personality attributes they measure, the inventories often 
add substantially to the predictive efficiency which can be obtained from 
tests of intellectual functions and special abilities. 


SUMMARY 


It is easy now to see that the early workers in personality gaara 
anticipated achieving their goals more quickly and AOR ay ok i 
actually been the case. This was probably due in large measa 3 = 
comparative ease with which the early tests of eee gadis nd 
structed and the many successful applications which have been eed 
tered. In this chapter we saw that there are some hopeful x ae AN 
personality measurement, but these are mainly complex Sige t A and 
require a relatively large amount of work both in their oa sia 
their application. The need, however, for adequate lites a be worth 
is so acute that even expensive, laborious techniques would be 
the cost, 
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CHAPTER 17 


Selected Techniques for Psychological Experiments 


There are many psychological studies in which individual differences 
among people are of only secondary interest. In Chapter 1, a distinction 
was mace between studies of individual differences and studies of psycho- 
logical processes, a process being defined as a set of systematic changes 
within the individual, Studies of psychological processes are usua 
referred to as experiments, in distinction to the more general term 
which is used to refer to systematic observations of all kinds, including 
studies of individual differences. 

The purpose of this chapter is to consider some of the measurement 
techniques that are currently receiving wide use in psychological experi- 
ments. I» addition to their use in experiments, some of the techniques to 
be discussed are useful in “descriptive” studies, where the effort is to 
determine the mean responses of a group of individuals. The opinion poll 
is an example of a descriptive study. 

Little attention will i given in this chapter to measurement methods 
for studies of physiological psychology, perception, and learning. These 
are treated adequately in available texts. The emphasis will be on meas- 
urement in social psychology and the study of personality. 

In its simplest form, an experiment involves w lela or measures. 
One variable is manipulated by the experimenter, or 
waits for the vrik @ change by natural causes. The variable which 
changes in this way is called the independent variable. The experimenter 
then determines the influence of the independent variable on a second 
measure, the dependent variable. 

The simplest pie: of experimentation can be illustrated with er well- 
known laws of gases. The problem is to determi the influence iari 
sure on the temperature of a gaseous body. A gas, say ordinary f the 
enclosed in a cylinder. A gauge on the cylinder tells the el eas 
gas. A thermometer notes the temperature inside the cylinder, S ble. 
perature is the dependent variable. Pressure is the marge! = of 
A plunger in the cylinder is slowly forced down, raising the pressure 
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the gas inside. It is found in this way that temperature is a linear function 
of pressure. An experiment in which change in the independent variable 
has no influence on the dependent variable can be found in Galileo's laws 
of falling bodies. If, as story has it, Galileo dropped a small and a large 
stone from the Leaning Tower of Pisa, he demonstrated that weight, the 
independent variable, has no influence on the velocity of falling bodies, 
the dependent variable. 

Nearly all of the test and measurement techniques which have been 
described in previous chapters can and are used as variables in psycho- 
logical experiments. Some examples are as follows: 

1. Psychophysical judgment. The purpose of the experiment is to deter- 
mine the influence of alcohol on sensory discrimination. Subjects are given 
an ounce of alcohol each fifteen minutes over a period of an hour and 
a half. Five minutes after each “cocktail,” the threshold for faint light 
signals is tested by the method of limits. Amount of alcohol consumed is 
the independent variable and height of threshold is the dependent 
variable. 

2. Achievement testing. The purpose of the experiment is to determine 
the differential amount of learning from two methods of te aching statis- 
tics. Six instructors each teach two sections of statistics. One section is 
given three lectures per week. In addition to the three lectures, the other 
section is given a laboratory session in which statistical problems are dis- 
cussed and worked through. In this experiment, the independent variable 
is expressed as a qualitative distinction between two methods of instruc- 
tion. The dependent variable consists of a standard achievement test 
which is given to all students, An analysis is undertaken to determine 
which method of instruction produces the higher scores and to find the 
kinds of items which are more often correctly answered by the students 
in one or the other circumstance. 

3. Intelligence testing. The purpose of the experiment is to determine 
the influence of distraction on intelligence test scores. A group of 300 col- 
lege students is randomly divided into three subgroups of 100 each. One 
of the group intelligence tests is given to each subgroup in a separate 
room, Group A is tested in a quiet room as free as possible from distrac- 
tions. Group B must work while a phonograph record plays lively music. 
Group C is bombarded with noise. The same phonograph record used 
with Group B is played along with the noise of sawing wood coming from 
a buzz saw in the back of the room and the heavy pounding of a bass 
drum from the front of the room. Amount of distraction is the independ- 
ent variable and intelligence test score is the dependent variable. 

4. Attitude measurement, The purpose of the experiment is to deter- 
mine the influence of living in a Northern city on attitudes of Southerners 
toward Negroes, The dependent variable, a measure of attitude toward 


Selected Techniques for Psychological Experiments 375 


Negroes, is given to 1,000 migrant Southerners who have lived in the 
North from one month to thirty years. The independent variable is the 
number of months’ residence in the North. The independent variable is 
correlated with the dependent variable. 

5. Projective testing. The purpose of the experiment is to determine the 
personality changes that go on in two different kinds of psychotherapy. 
In one type of therapy, the patients are seen in individual sessions. In the 
other, patients meet as a group and talk over mutual problems. All 
patients are given Rorschach tests before and after the treatment. Expert 
examiners rate the Rorschach records for amount of anxiety, insight into 
problems, severity of illness, and other attributes. Comparisons are made 
to see which type of therapy produces the most change and the kinds of 
changes that go on in each, The independent variable is the qualitative 
distinction between kinds of therapy. The dependent variable is the 
amount of improvement shown by the examiners’ ratings. 

The Validity of Experimental Variables. The validity of measures of 
individual differences was discussed in Chapter 4. The distinction was 
made between predictors and assessments, and rationales were presented 
for validating each type of measure. Although the rules given there are 
not completely airtight for all testing problems, they are sufficient guides 
for the major uses of psychological tests—in vocational guidance, school 
management, classroom testing, and personnel selection. 

How can the validity of experimental variables be determined? In other 
words, how can it be determined whether or not an experimental tech- 
nique measures what it is purported to measure? The rules which will 
be given here for the validation of experimental variables are less definite 
than those given in Chapter 4 for the validation of individual difference 
measures. As a first principle, there is usually more difficulty in validating 
dependent rather than independent variables. Independent variables are 
usually real-life things such as amounts of time, kinds of training, and 
other variations of circumstances. There is often considerable difficulty in 
determining whether dependent variables are what they are purported to 
be: The examples, 1 through 5 above, of psychological experiments were 
ordered in terms of the difficulty with which their dependent variables 
could be validated. , 

Any experimental variable which can be considered an assessment in 
terms of the rules given in Chapter 4 can be judged by the standards for 
assessments. Such measures are either reasonably self-evident measures of 
what they purport to measure, such as sensory threshold in the first exam- 
ple, or measures founded on the principle of content sampling, such as 
the achievement test in example 2. It is not with measures of the assess- 
ment type that particular difficulty is met in the validation of experi- 
mental variables. Assessments used as experimental variables can 
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validated with equal ease, or difficulty, as those used in studies of indi- 
vidual differences. 

The difficulty in validating experimental variables comes with the use 
of instruments that are tests of the predictor type. Examples 3, 4, and 5 
above all use dependent variables which the author would classify as 
predictors. The function of the dependent variable is not to predict some 
future behavior outside of the experiment but to measure the change that 
goes on within the experiment. Consequently, experimental variables 
should all be assessments. Unfortunately, either nothing is known or 
nothing is feasible to use in many experiments except measures of the 
predictor type. 

Whenever experimental variables consist of predictor measures, the 
experimental results are open to interpretation and, more or less, to 
uncertainty as to what they mean. In example 4 above, the experimental 
results would show only that people express different attitudes after they 
have been in the North for a considerable period of time. Verbal behavior 
is important, and in some studies this is all that is meant to be measured, 
but in many experiments the object is to show that attitude in some more 
basic sense has changed. It would be hoped that the changes in response 
to the attitude-measuring device reflect real changes in the social response 
to Negroes rather than only a change in verbalized attitudes, 

The meaning of the results in the fifth example above are directly 
dependent on the Rorschach test and its use by experienced examiners as 
a measure of personality attributes. Because of the generally unanswered 
questions about validity, most psychologists would view the results of 
the experiment with some skepticism. 

The validity of predictor measures in psychological experiments is 
dependent on what the study purports to show. The purpose in some 
experiments is to show that predictor measures are influenced by the cir- 
cumstances in which they are administered. Example 3, above, is meant 
to show that distraction in the testing situation influences scores on intel- 
ligence tests. No one would interpret the results of the experiment as 
showing that distraction influences intelligence but only the testing of 
intelligence. 

Some of the experimental variables which will be discussed in the fol- 
lowing sections can be used either to record what the subject says that 
he likes or that he thinks, or they can be used inferentially to measure 
what he likes and thinks. In the former usage, when the intention is only 
to determine what the subject is willing to say about himself and others, 
the results can usually be taken at face value. In the latter case, when an 
experimental variable purports to measure underlying tendencies in the 


individual, not just verbal report, the validity of the variables is open 
to question. 
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When instruments of the predictor type are used as experimental varia- 
bles and the intention is to measure something more than verbal report, 
the principal validation procedure is that of construct validity as it was 
discussed in Chapter 4. That is, a measure of the predictor type gains 
meaning of its own through continued use and research, That is why the 
intelligence test in example 3 above can be considered to have more 
construct validity than the attitude-measuring instrument in example 4. 
And similarly, the attitude-measuring instrument in example 4 has more 
construct validity than the Rorschach test in example 5. The building-up 
of construct validity for an instrument is a slow process, and no instru- 
ment in psychology has been studied sufficiently to say unequivocally that 
it has a specified degree of construct validity. However, for many of the 
experimental variables that are in use, validity can be known only by 
the slow building-up of connections with other human attributes. 

It would be unduly restrictive to expect that researchers who are work- 
ing on the “cutting edge” of psychological science should use only experi- 
mental variables of known validity. In the beginning stages, a new meas- 
ure is more a hunch than a proved instrument. But it is also incumbent 
upon psychological investigators to keep their measures open to continu- 
ous research and not simply to retain measures because of faith, fiat, or 
fashion. 


THE Q-SORT 


The Q-sort rating technique is most prominently identified with the 
work of William Stephenson (see 23) and his colleagues. The particular 
rating scheme is only one aspect of a more general methodology for the 
study of personality. Stephenson’s methodology will not be discussed 
here, only the handy Q-sort rating procedure, which lends itself to a 
wide variety of studies. : 

The Q-sort was developed as a means of recording and sept 
multiple judgments, preferences, and impressions. In a typical study, oe 
jects are asked to rate 100 photographs of art objects in terms i t n 
preferences, Instead of rating each of the photographs separate y» t : 
subject is asked to “sort” them in terms of a forced-normal distribution. 
typical forced-normal distribution for 100 items is shown in Table 17-1. 


TABLE 17-1. TYPICAL FoRCED-NORMAL Q-sort RATING DISTRIBUTION 
For 100 ITEMS 
Number of items 
24 8 12 14 20 14 12 8 4 2 
Prefer least jr 3 eee Prefer most 


Pile number 
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In the forced-normal distribution shown in Table 17-1, the subject is 
first required to pick the two photographs which he likes best and place 
them in pile 10. Then he chooses the 4 photographs which he likes next 
best of the remaining 98 photographs and places them in pile 9, and so 
on to the 8 photographs in pile 8, and the 12 photographs in pile 7. The 
pile numbers are written separately on file cards and spread out on a 
large table or desk. The subject literally sorts out the photographs to 
demonstrate his preferences. 

The most widely used approach is to have the subject work from both 
extremes of the distribution toward the middle. That is, he first sorts the 
items into the several “most prefer” categories, say in this case working 
down as far as pile 7. Then he switches to the “least prefer” end, working 
down from pile 0 to about 3 in the present example. Finally the items 
are placed in the remaining middle piles. One reason why this approach 
is used is that extreme likes and dislikes are usually easier for the subject 
to rate. A second reason for working from the extremes to the middle of 
the continuum is that the extremes are much more important than the 
middle piles in determining the size of correlations and other statistical 
measures. Therefore, it is more important to have reliable ratings on the 
extremes than on the middle piles. 

A forced distribution can be constructed quite simply for any-sized 
item sample being used. This can be done by referring to the areas under 
different sections of the normal distribution (see Appendix 3). A simpler 
procedure is to make up a distribution which looks like the normal curve 
without going to a more rigorous effort to fit the normal distribution form. 
It matters very little whether the distribution is precisely normal or only 
approximates the normal distribution, 

The number of categories, or piles, varies in different studies in accord- 
ance with the size of the item sample. A rule of thumb is to use 9 piles 
with 60 to 80 items, 11 piles with 80 to 100 items, and 13 piles with 
samples of more than 100 items. Most persons prefer an odd rather than 
an even number of piles. An odd number of piles provides a middle pile 
that can be used for “neutral” items or items that are unrelated to what 
is being rated, 

Item Universes and Item Samples. One of the appealing features of the 
Q-sort is its ability to handle a wide variety of item material. All things 
that can be photographed, drawn, or written on cards can be used as 
items. This provides a means of studying many kinds of materials that 
would otherwise be experimentally unmanageable, 

One of the praiseworthy features of the Q-sort is that it emphasizes the 
sampling of item content, The collection of items actually used in the 
Q-sort, 100 photographs in the example above, is referred to as an item 
sample, or it is variously called a “trait sample,” or a “stimulus sample.” 
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When properly done, the item sample should be as representative as pos- 
sible of a specified “item universe.” Content sampling was discussed previ- 
ously in respect to achievement tests and assessments generally. A rating 
method like the Q-sort also has more meaning when the items range 
broadly across a larger body, or universe, of possible items. 

Analysis of Q-sort Data. The completed Q-sort represents the subject's 
preferences for or judgments about the things in an item sample. The 
simple description of peoples’ responses may be all that the research seeks 
to accomplish, but in most cases the object is to compare the responses 
of subjects with one another or to compare the responses of subjects 
before and after an experimental treatment. The correlation coefficient 
is the statistic which is used most often in the analysis of Q-sort data. 
One sort can be correlated with any other that is made with the same 
item sample. This provides a correlation between two persons rather 
than the more conventional correlation between two tests. The correla- 
tion between two sorts is an index of the similarity of the item placements 
either by two persons or by the same person on two occasions. Because of 
the forced distribution that is used for all sorts made with the same item 
sample, a short-cut correlational formula can be applied. The formula is 
presented in Appendix 7. It allows the computation of correlation coeffi- 
cients in only a fraction of the time that is required by conventional 
formulas like those presented in Chapter 5. 

Correlations among sorts are often carried on to a factor analysis. The 
correlation matrix contains all of the intercorrelations among the sorts of 
a number of persons; or in a few studies (see 14, 15), the correlation 
matrix is filled with the intercorrelations of a number of sorts made by 
only one person. Conventional factor-analytic techniques (see 10) can be 
applied to the intercorrelations among Q-sorts as well as to the intercorre- 
lations among tests. Each factor represents a common way of ordering the 
items by a number of persons. 

In the illustrative study mentioned earlier with the 100 pictures of art 
objects, the preferences of some subjects might be determined by realism, 
by the extent to which the art objects accurately represent real-life forms. 
The subjects whose choices are made in this way will tend to correlate 
with one another, Other subjects may sort the 100 pictures in terms of 
their impressionistic value, regardless of how well the items picture real- 
life forms. The sorts of the impressionists will tend to correlate more with 
one another than they correlate with those of the realists. These are the 
conditions under which clusters arise in the correlation matrix, and the 
impressionists and realists are likely to be separated as factors in “aie 
formal factor-analytic treatment. Most persons are usually found 42 5 aye 
a mixture of factors rather than to have one factor only. That is, a sala s 
sort can be determined both by realism and impressionism. The loadings 
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which the subject has on the different factors show the extent to which the 
factors determine his preferences or judgments. 

Evaluation of the Q-sort. One point of criticism which has been leveled 
against the Q-sort concerns the forced-normal distribution The rating 
technique assumes without any formal proof that a norma! distribution 
fits the ratings of different subjects. The forced distribution requires all 


subjects to have the same mean, the same standard deviation, and the 
same distribution form. It is possible and quite likely that if subjects are 
left to use whatever distribution they like, they will differ from one 


another in the way in which they distribute the items. 

It can be counterargued that the forced distribution is only a conven- 
ience which does little injustice to the data one way or the other. If, as is 
usually the case, correlations among persons and subsequent factor analy- 
sis are the major statistical treatments, the forced distribution distorts the 
data almost not at all. It was shown in Chapter 5 that the correlation 
coefficient automatically equates means and standard deviations of two 
sets of numbers. The correlation coefficient is insensitive to even marked 
changes in the distribution form. If two normally distributed variables 
correlate a particular amount, say .50, the correlation will usually not 
change by more than .01 or so if both distributions are flattened to a 
complete rank-order. Likewise, skewing the distributions or even making 
them bimodal usually brings about only small changes in particular cor- 
relations. The forced distribution then saves a great deal of time in per- 
forming correlations (see Appendix 7) when only adding machines and 
desk calculator equipment are available. 

The Q-sort is a relative rating technique. It requires the subject to 
decide which things he likes better or to judge which things possess more 
of a stated attribute. Some persons have criticized the Q-sort on this basis, 
claiming that absolute ratings should be used instead. An example of an 
absolute rating is the conventional rating scale, in which each item is 
rated separately: either into one of two categories or along a continuum. 
Absolute ratings could be used in the illustrative study of art objects 
mentioned earlier. Each photograph could be rated separately on a 
seven-point scale ranging from “like very much” to “dislike very much.” 

The people who criticize the relative ratings in the Q-sort evidently fail 
to consider that almost all of the traditional psychophysical methods are 
relative ratings also. Looking back to the discussion of psychophysical 
methods in Chapter 2, it can be seen that the methods of constant stimuli, 
rank-order, and pair comparisons are all relative ratings. Either rank-order 
or pair comparisons could be used to study judgments or preferences with 
respect to item samples, but both of these are far more laborious than the 
forced distribution with little gain in precision. 

Rather than argue that absolute rating methods are better in general 
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than relative rating methods or vice versa, it is more fitting to consider 
them as alternative techniques for handling different kinds of problems. 
Relative ratings make more sense with certain kinds of studies and abso- 
lute methods are more sensible for others. Relative ratings, like the Q-sort, 
should be used only when the items are all from some common frame of 


reference. To take an extreme example, it would be incorrect to have the 
subject sort 100 pictures, half of which are photographs of automobiles 
and half of which are photographs of houses. In making the Q-sort the 
subject would have to decide not only which houses he likes better and 
which automobiles he likes better, but whether he likes automobiles 


better than houses. This would make for a very ambiguous rating task. 

Relative rating techniques, like the Q-sort, must be used when there is 
no concrete basis on which absolute ratings can be made. For example, 
no one would seriously consider using absolute ratings in a study of lifted 
weights. It would make little sense to have the weights judged on, say, a 
seven-point scale ranging from “very heavy” to “very light.” Similarly in 
many studies of preferences and impressions, there is little concrete basis 
on which absolute ratings can be made. For example, the individual can 
feel quite safe in his choice between steak and lamb chops but have diffi- 
culty in deciding how much he likes either in an absolute sense. Such 
preferences are inherently comparative. On the other hand, there are 
many studies in which absolute ratings are essential. This is so in most 
studies of attitudes. For example, it might be of some interest to learn the 
individual's relative feelings about Negroes as opposed to Jews and Ori- 
entals; but the major purpose of such studies is to learn the degree of 
favorableness of the individual's attitudes toward different groups. 

As a final point of criticism about the Q-sort, there has been some con- 
troversy about the objectivity of such ratings. People who have used the 
technique in clinical study, particularly projective testers, have spoken 
of the Q-sort as being objective, whereas critics of the method say that ‘ 
is subjective. A distinction needs to be made between objectivity ae 
validity. The Q-sort provides a means of objectifying ences ca 
they are impressions nevertheless and not necessarily correct or is id un 
proved so, The important point is that rating techniques like the Q-so: 
provide the first step in studying the validity of impressions. PRIN 

Illustrative Studies. The Q-sort rating technique is not an : kayin 
any one field of psychology. It is a method for investigating 3 wil a nea 
of problems. The following illustrative studies are presented to sho 
range of topics that have been studied. i TA 

Personality. Nunnally (14) studied the changing e Si ob- 
One subject over a period of two years’ time. The item sample vis Mie 
tained from a three-months prestudy of the individual. ve ee oa 
tilled from protracted discussions with the subject and yato 
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friends. Additional items were suggested by projective and aptitude test 
results. The 60 items used in the Q-sort were meant to cover a broad 
range of the things that the subject says about herself or m say under 
appropriate conditions. Some of the statements are as follows: 


a. I sometimes think of myself as being neglected and unloved. 

b. I feel pleasantly exhilarated when all eyes are on mi 

c. I sometimes feel that I could run out of the room I am in, scream, 
or burst into tears. 

d. When ata party I can get quite drunk to escape the monotony. 


The subject used the item sample to describe herself, sorting the items 
from “most characteristic of myself” to “least characteristic.” She made 
15 separate Q-sorts to describe her behavior in different social settings. A 
factor analysis showed three factors, representing three different modes of 
behavior. Predictions were then made about how the individual would 
sort and how the three factors would change after psychotherapy. The 
main purpose of the experiment was to demonstrate ways of studying 
self-conception. 

Communications. Swanson + carried out a series of Q-sort studies con- 
cerning the readership of magazines. In one study he was interested in 
the effect of title pages on the readership of different articles. One hun- 
dred title pages were sampled from the issues of a particular magazine 
over several years. These were placed individually on cardboard squares 
and used as the items for Q-sort ratings of preference. The responses of 
20 subjects were factor analyzed. The four factors which were obtained 
reflected interestingly on why different people read different kinds of 
articles. Illustrating the type of results which he obtained, one of the 
factors is a clear measure of sensationalism. The title pages relating to 
the factor contain pictures of corpses, daggers, guns, voluptuous women, 
and the like. The titles themselves contained terms like B-girl, blood- 
hound, murder, F BI, and sex. 

Psychotherapy. The field of psychotherapy abounds with different 
theories about the therapeutic process and about methods of treatment. 
Fiedler (6) performed a Q-sort study of the viewpoints of therapists 
about the therapeutic process. The item sample contained statements 
concerning a wide variety of ways in which the therapist might interact 
with the client. The therapists included 20 practitioners each of psycho- 
analytic, nondirective, and Adlerian schools. Ten members of each school 
were expert psychotherapists and ten were relatively inexperienced, non- 
expert therapists. Each therapist was asked to sort the item sample in 
such a way as to describe his behavior in therapeutic interview situations. 
Correlational analyses showed that experts in the three schools of thought 

* Personal communication from C. E. Swanson. 
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gave Q-sort descriptions which were more alike than were the descrip- 
tions of experts and nonexperts within any one school. The results suggest 
that therapists become more alike in their practice with continued experi- 
ence and that differences between schools of psychotherapy are not as 
marked as they would seem. 

Esthetics. Stephenson (23, pp. 128-144) used the Q-sort technique to 
study the esthetic values of art students. The item sample consisted of 
120 cards containing various arrangements of colored rectangles. The rec- 
tangles were arranged in such a way as to be either overlapping or non- 
overlapping and to form either regular or irregular designs. The hypothe- 
sis was that experienced artists would prefer irregular and overlapping 
designs and that artistically untrained persons would not prefer that 
combination. Stephenson’s interest in the problem was largely methodo- 
logical, to show how the Q-sort could be used in studies of esthetics. A 
small-scale investigation with 18 subjects tended to uphold the hypothesis. 


THE SEMANTIC DIFFERENTIAL 


The Semantic Differential rating instrument is most prominently identi- 
fied with the work of C. E. Osgood (see 19) and his associates. It grew 
out of the need for a measure of “meaning,” for a way of indexing the 
manifold implications of objects, persons, and ideas for the human 
observer. Perhaps no other construct is as inextricably interwoven with 
psychological and philosophical theory as is meaning. It is axiomatic that 
human behavior is determined by the meaning of events rather than by 
intrinsic properties of events. The baby reacts approvingly to its ence s 
voice because that has acquired the meaning of nourishment, hie hie 
and protection. A soft buzzing sound will strike fear in the experience 
outdoorsman for whom the meaning is rattlesnake and danger. On a more 
complex social level, different people harbor different ar Stier 
cepts like Republican Party, Jew, Protestant Church, Cadillac, college, 


and even chop suey. 

The fundamental hypothesis und 
that certain important components 0 
rating of objects or ideas in respect t 


like good versus bad, strong versus weak, Beaton git 
are continua along which everyday meanings are exp . 


phorical level, we may, for example, speak of a Pee te eae 
than warm. The implication is that the cold individua ee an mh The 
person rather than that his bodily temperature is below ai d to date 
American culture as well as a number of other cultures = io a wide 
(see 19, pp. 170-176) have a common apiha eo PA 
variety of bipolar terms. It seems to be almost universal that g 


erlying the Semantic Differential is 
f meaning can be measured by the 
o bipolar adjectives. Bipolar terms 
ak, and intelligent versus ignorant 
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imply white instead of black, high instead of low, and full instead of 
empty. 

Each set of bipolar adjectives is called a scale on the Semantic Differ- 
ential. The number of scales in most studies ranges from a half-dozen to 
more than thirty. The customary approach is to use a seven-point con- 
tinuum for each set of bipolar adjectives. The thing to he rated, the 
concept, is placed at the head of a page above the bipolar scales. An 
abbreviated form is shown in Figure 17-1. A list of scales is presented in 
Appendix 8 which contains the bipolar adjectives which have been most 
widely used in studies to date. The same set of scales is used to rate all of 
the concepts in a particular study. The pages of the rating form differ only 
in terms of the concept at the top of each sheet. 


Ficure 17-1, A Semantic Differential form for rating the concept Napoleon, 


Napoleon 


Good) ase ppt Bad 
We ai fg d Strong 
Hopp ete eee Sad 
Le SS a a a Intelligent 
Fost Steet fg Slow 
Oe a a a Light 


Multidimensional Meaning. Basic to the use of the Semantic Differen- 
tial is the hypothesis that meaning is multidimensional, that there are a 
number of facets of meaning rather than only one continuum along which 
things are ordered. At the rawest level, each scale can be taken as a 
separate facet of meaning. Two concepts can be said to have similar 
meanings if they are given nearly identical scores on a wide variety of 
scales, If two concepts have similar scores on some of the scales and not 
on others, this provides a way of charting the ways in which their mean- 
ings are similar and different. But if each scale must be considered a 
separate facet of meaning, this provides little understanding of the 
nature of meaning and also makes the results of any one study dependent 
on the particular scales which are used. Fortunately, bipolar scales like 
those used in the Semantic Differential evolve into a relatively small num- 
ber of common factors, F actor-analytic studies by Osgood and Suci (18) 
and by others of a wide variety of scales applied to a wide variety of con- 
cepts have agreed on the following factors: 


l. The Evaluative Factor, most prominently identified by scales like 
good-bad, valuable-worthless, kind-cruel, bitter-sweet, clean-dirty, fair- 
unfair, and pleasant-unpleasant. 
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2. The Potency Factor, most prominently identified by strong-weak, 
large-small, heavy-light, and thick-thin. 

3. The Activity Factor, most prominently identified by active-passive, 
fast-slow, hot-cold, and sharp-dull. 


The three factors above are ordered in terms of the variance that they 
account for in the scales which have been studied to date. The evaluative 
factor is by far the most prominent of the three. It appears to measure 
“attitude” much like that which is tested by conventional attitude scales. 
Most scales have at least moderate loadings on this factor and many have 
loadings well above .80. Some of the scales are mixtures of the factors 
rather than measures of one factor only. For example, hard-soft is a mix- 
ture of evaluation and potency, and relaxed-tense is a mixture of evalua- 
tion and activity. 

The three factors described above are useful in classifying scales. Thus, 
rather than include a large number of scales which are known to be 
highly evaluative in a rating form, only a smaller number of the most 
highly loaded scales need be used. Similarly, a few scales can be included 
to represent potency and activity. This will provide ample room for 
including additional scales that have particular relevance to the concepts 
being studied. 

By averaging scores on the most highly loaded scales for each factor, 
an approximate factor score can be obtained for each individual. Exact 
factor scores require more complex statistical treatments (see 27, 
chap. 15). For example, if an individual rates the concept labor union on 
four of the purely evaluative scales at positions 4,5, 6, and 5 respectively, 
the average of these, 5, can be taken as his over-all evaluation of labor 


Ficurr 17-2. The scores for one person on the evaluative and potency factors for the 


concepts God, My Father, Satan, and Beggar. 
Evaluation 
(high) 
e God 


e My fother 


t high) 
Potency (low) Potency (hig 


è Beggar 


e Satan 


Evaluation 
(low) 
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union. In a similar way, factor scores on the potency and activity factors 


can be found for labor union or any other concept in the particular study. 
The resulting scores can be pictured geometrically in what is called a 
semantic space. Each factor is used as an axis, resulting in a three- 
dimensional scattering of points. The position of each point in the space 
and the proximity of points to one another can be used to understand the 
meaning of particular concepts either for one individual separately or for 
the average responses of a group. A two-dimensional example is shown 


in Figure 17-2. 

Illustrative Studies. A great deal of empirical work has been under- 
taken in recent years with the Semantic Differential (see 19), ranging 
from studies of dream symbolism to cross-cultural studies of language. 
The following studies were selected to illustrate the wide variety of 
applications for the measuring instrument. 

Psychotherapy. Because of the traditional interpretation of psycho- 
therapy as a way of changing attitudes toward or meanings about one’s 
self and the surrounding human environment, the Semantic Differential 
has a welcome place in understanding the psychotherapeutic process. One 
of the most interesting studies of this kind concerned the now famous case 
of triple personality first reported by Thigpen and Cleckley (26). They 
made a protracted study of a psychotherapy patient who periodically 
displayed such marked changes in attitudes, gestures, voice quality, and 
general behavior as to suggest an almost complete movement from one 
personality to another. The subject’s customary “personality” was given 
the pseudonym of Jane, and the less frequent alternative modes of 
behavior were referred to as Eve White and Eve Black respectively. 
Thigpen and Cleckley followed the case until psychotherapy “killed off” 
the two Eves, leaving only the behavior typical of Jane. 

The Semantic Differential entered the study as an independent measure 
of the different “personalities” and as a means of charting the changes 
brought by psychotherapy. A Semantic Differential form concerning con- 
cepts in the patient's life was administered to each of the three “person- 
alities” as they occurred in different psychotherapy sessions. The same 
three measures were taken later in therapy, and follow-up measures were 
taken over the subsequent months. With little knowledge of the case 
other than that there were three “personalities,” Osgood and Luria (17) 
attempted a blind analysis from the Semantic Differential data. They were 
not only able to distinguish the three “personalities” and to provide con- 
siderable insight into the behaviors involved in each, but they also made 
some correct predictions about the final personality which would be evi- 
denced in the ratings. 

Developmental Psychology. Following the hypothesis that meanings are 
learned and not inborn in the child, there should be an increasing tend- 
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ency ío: children to agree with one another about meanings as they grow 
up. For example, four-year-old children are likely to hold different mean- 
ings bout the word lion or a picture of the animal. One child will think 
of thc lion as a pretty pussycat and another child will consider the animal 
as an immediate danger, lurking just beyond the cellar door, Older chil- 
dren «nd adults are more likely to agree with one another that the lion 
is powerful, dangerous, and remote from the immediate environment. 


Donahoe (see 19, p. 289) used a Semantic Differential form to study 
the movement toward agreement in meanings of common objects as a 
function of age. Using a standard set of concepts and scales, subjects 
were tested at four age levels ranging from the first grade to the college 
level. He found, as predicted, an increasing tendency toward agreement 
among subjects going from younger to older ages. 

Learning Theory. A number of studies have been undertaken to relate 
responses on the Semantic Differential to typical measures of learning and 
habit formation. Osgood and his colleagues (see 19, pp. 155-159) studied 
the relationship between latency, how long the subject takes to respond, 
and the extremeness of ratings on Semantic Differential scales. They found 
that when a subject marks a concept on the extremes of a scale, on the 1 
and 2 or 6 and 7 positions, he makes up his mind more quickly than if the 
mark is made toward the center of the scale. 

Staats and Staats (22) have recently used the Semantic Differential in 


i in 
pleasant than subjects who had received negative evaluation terms 
conjunction with the country. This suggests, quite = reagan that 
meanings as measured by the Semantic Differential can be tioned 


studies of public reaction to social issues, minority groups, and commer- 


disorders and toward various social and family roles. The form was 


s of an opinion panel, a group of individuals 
administered to 200 members ‘ble the whole United States population 
i Scale profiles for 

found that indi- 


* Unpublished research. 


Ficure 17-3. Semantic Differential 
(...), and Neurotic Man (.. 
subjects. 
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viduals with a mental disorder are viewed as relatively bad, weak, a 
inactive, in comparison to normal people. The scale which best dif ere 
ates public attitudes toward mental disorders from attitudes toward r 


profiles for the concepts Me | ? 
-). Each point represents the mean rating of 2 


—), Old A 


Neurotic man Siain 
Bod a Good 
Worthless Valuable 
Dirty i Clean 
\ 
Insincere "i Sincere 
Foolish i, Wise 
Ignorant Intelligent 
Dangerous Séfe 
Sad ; Happy 
Poor Rich 
Unpredictable i ` Predictable 
Tense 2 Reloxed 
Sick a Healthy 
Weak / Strong 
Delicate ot Rugged 
Passive ; Äi $ Active 
Slow 2 i Fast 
Cold he Warm 


Evaluation of the Semantic Differential. One of the major questions 
ask about the Semantic Differential is, “Does it measure meaning?” 
concepts father and mother receive similar ratings by most subjects, V 
some consistent differences. Does this imply that father and mother mea 
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the suc thing to most subjects? Of course, fathers and mothers are not 
completely similar in all the ways that the term meaning can be con- 
strus! But in another sense they are much the same, providing love, 
guidance, security, and also discipline. It is best said that the Semantic 
Differential measures some components of connotatice meaning rather 
than meaning in all of its aspects. It measures the implications of objects 
rather than the total perceptual reaction to objects. Another way of 
expluning it is to state that the Semantic Differential is concerned with 
metuphorical meaning. 

Soine people wonder whether sensible ratings can be made with some 
of the scales that are applied to particular concepts. For example, what 
sense does it make for an individual to rate the Empire State Building on 
the scale hot-cold or on the scale friendly-unfriendly? There is no concrete 
Way in which subjects can be told the basis for making such ratings, but 
practical experience has shown that they can do it anyway. A very 
detailed series of reports on reliability (cf. 19, chap. 3) shows high reli- 
abilities for factor scores and even surprisingly high reliabilities for indi- 
vidual scales. 

Here in its purest form is exemplified the difficulty of validating an 
experimental technique. There is little else with which to measure con- 
notative meaning and certainly no more dependable index with which 
Semantic Differential scores can be correlated. In cases of this kind the 
only appeal is to construct validity as it was discussed previously. The 
many researches that are being undertaken with the instrument pein 
and the many relationships which are being shown with other varia 
are gradually demonstrating what the instrument can and megt. p 
ure, Considering the wide use of the instrument and the string 
esting research results, the Semantic Differential gives real promise as a 
measure of connotative meaning. 


SIMILARITY JUDGMENTS 


A number of rating techniques (see 12, i cp Recep 
that depend entirely on judgments of simi s a poe procedure of this 
persons, objects, and pens The = er ly 
kind is the method of triads. In a typical ything from 
the relationships among a dozen stimuli. ee sam! to 


the names of persons to 
: i asked to judge which two are more 
the subject three at a time, ak two of fete stimuli are most 


similar. Next, the subject is ask ) srn 
different. The subject is presented in tum with all ea pi 

stimuli, or he is presented selectively with as many Cnel necessary 
to ensure that each stimulus appears at least once others. 
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The analysis of similarity judgments depends on the percentages of 
times that objects are judged to be similar to or d; erent from one 
another. The percentages can be obtained either by repeating the whole 
set of triads many times with the same person, or more typically, the per- 
centages are calculated over the response of a group of subjects, Statistical 
procedures are available (see 1, 13, 21, 28) for determining the positions 
of the experimental stimuli on the continua which underlie the judgments, 
A number of other methods of similarity judgment (sec 12, pp. 246-251) 
provide much the same results as the method of triads. 

Perhaps the best illustration of similarity judgments can be given with 
a type of study which, to the author’s knowledge, has not yet been under- 
taken. The problem is to learn how a group of industrialists in a particular 
city judge members of the United States Senate. Twenty senators are 
selected in such a way as to cut broadly across what are thought to be 
dimensions of political and governmental philosophy. The name of each 
senator is typed separately on a file card. One hundred i ndustrialists are 
used as subjects. Each is required to make similarity judgments by the 
method of triads, The responses of all subjects are placed together in a 
table, showing the per cent of times that senator A is judged to be more 
similar to B than to C, and so on. 

The data can be used to deduce a number of things about how the 
industrialists view the Senators. Primarily it can be determined which 
Senators are viewed as being more similar in their political behavior. 
Secondly, an analysis will indicate the number of underlying dimensions 
along which the Senators are judged. For example, it might be apparent 
that one underlying continuum concerns forei gn policy, a second concerns 
fiscal policy in the United States, and a third concerns social welfare legis- 
lation. The analysis would provide a means of locating each of the 20 sena- 
tors on the three dimensions. In general, the study should provide a better 
understanding of how Senators are judged by the particular group of 
subjects. 

_ Evaluation of Similarity Judgments. One obvious feature of similarity 
judgments is that they are very laborious. If there are 20 stimuli as in the 
example above, there are 1,140 possible triads. If the subject makes even 
a fraction of these judgments, the gathering of data is time consuming. 


The subsequent task of statistical analysis is difficult and still somewhat 
controversial, 


It may help to understand the method of similarity judgments by com- 
paring it with the Q-sort and Semantic Differential. The Q-sort could be 
applied to the hypothetical problem above by having each industrialist 
describe each senator with, Say, a sample of 50 items. The items would 
be such as “favors more financial aid to England,” “supports high tariffs,” 
and “wants more federal aid to education.” The Semantic Differential 
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would use each of the senators as a concept with scales like strong-weak, 
libera!-conservative, wise-foolish, kind-cruel, and safe-dangerous. 
The judging of objects in terms of similarity is both an advantage and a 


disadvantage. Similarity is a primitive idea that is difficult to explain to 
subjects, and it is easy to imagine experiments in which judgments of 
similarity would be hard to make. The use of similarity judgments also 


provides little information for interpreting the results. It can be deter- 
mined that certain objects or concepts are judged as similar but an inter- 
pretation must be made of the underlying characteristic which makes for 
similarity. In both the Q-sort and the Semantic Differential the respective 
items and scales indicate more directly the way in which things are viewed 
as similar. On the other hand there are advantages to the similarity con- 
cept in ratings, It is not dependent on any set of items or scales, and thus 
the results often have a broader generality than those found with the 
Q-sort or Semantic Differential. Similarity judgments have to this time 
and will likely continue to have the most value in dealing with stimuli for 
which items or scales are very difficult to devise. The major studies to date 
have largely dealt with perceptual objects, geometrical forms, colors, and 
stimuli of that kind. 


‘TECHNIQUES FOR STUDYING INTERPERSONAL RELATIONS 


Much of psychology and particularly that branch called social psy- 
chology is concerned with the relationships between people and the ways 
that people behave in group situations. Because of the complexity of 
human interactions and the difficulties of measuring interpersonal re- 
sponses, studies in this area have lagged behind most other branches of 
psychological investigation. This section will consider some of the more 
promising inroads into the measuremen 
behavior. 

Sociometry. One of the earlie: 


t of group and interpersonal 


st and most widely used methods for 
understanding relationships within a group is based on see of pA 
group members for participation with one another. pa a ia 
illustrated with a hypothetical situation in which 12 high schoo N = 
are being divided into subgroups for work on a class me e . 
step would be to ask each of the students to note, in private, the re ‘ 
several persons with whom he would rather work. The w can ae 
be plotted as a diagram, or a sociogram as it is often a T is 4 ial 
by drawing arrows from person to person showing a c asw igu : 
17-4 shows a typical situation in which each student is aske gi 
first and second choice for a work partner. 
The sociogram contains no new information o 
the choices of individuals for one another, but i 


ver what is known from 
t does provide a way of 
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picturing and understanding the groupings of people. Looking at the 
sociogram in Figure 17-4, we find that persons 5, 8, 11, and 12 form a 
group, or clique. All of the first choices and all but one of the second 
choices stay within the group. Person 12 is especially popular in the 


group, receiving the first-choice nominations of the other three members. 
Persons 1, 6, 7, and 9 form a second group. Persons 2, 4, and 10 form a 
completely closed group, in which all choices remain among the three 


Ficure 17-4, Sociogram showing the choices for work partners in a group of 12 
students. 


—— First choice 
——--—> Second choice 


persons. Person 8 is an isolate, and for that reason might have difficulty 
in working with any of the individuals. Persons 6 and 11 serve as inter- 
mediaries between their respective groups. Their relationship might be 
used to get better communication between the two groups or help in 
getting them together on mutual projects. Relationships of the kinds 
shown in the sociogram would prove useful in dividing the class into 
subgroups, either placing individuals into the natural groups which they 
form or improving relationships between certain individuals to allow new 
groupings of the students, 

The sociogram technique can be applied to a wide variety of inter- 
actions including systems of leadership and communications. For exam- 
ple, in a study of military leadership, it would be expected that a hier- 
archy would be formed in which each individual receives his orders from 
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someone above him, leading to the officer who is in over-all command of 
an ope! ation. 

In addition to the simple pictographic inspection of the sociogram, 
there are a number of indices that can be applied to sociometric data 
(see 20). In the case where each individual makes only one choice, an 
estimate of an individual’s popularity in the group can be obtained as 
follows. 

Choice status of P; = a saa o£ pears choosing 4 
where P; stands for person i 
N stands for the number of persons in the group 


Because the individual is not allowed to choose himself, the denominator 
of the fraction is N — 1 rather than N. 

If individuals are allowed to choose as many persons in the group as 
they like, an index of over-all liking of the group members for one another, 
or what has been called “positive expansiveness,” is obtained as follows. 


no. choices P; makes 
M : a 
Positive expansiveness of P= 7 ae ES i 


Indices are also available (see 20) for describing the group as a whole. 
The following formula gives an indication of the extent to which all of 


the group members are incorporated, or integrated, into the group. 


1 
Group integration = >> of isolates (individuals receiving no choices) 


The most important use of sociometry and sociometric analysis N in 
measuring experimental changes in group structures. For example, it 


might be hypothesized that a certain type of leadership will alter the 


interrelationships of group members in a particular manner. An experi- 
leadership is instituted. 


ment is undertaken in which the special type of 
Sociometrie analysis is made before and after the experiment to measure 
the change in group structure. A 
‘area Boreala It is a truism that much of human a 
behavior is determined by how the individual regards other people, ee 
well he understands the feelings of others, and the contrast between what 
the individual thinks of himself and of other individuals. Empathy, or 
the ability to understand others, is a variable that has been — =, 
sively in this regard. In the psychotherapy situation it would e ape : 
that the effective therapist could understand well how the patien R > 
and feels. It might also be expected that adjustment in EÀ ae 
depend in part on how well the individuals understand o ee s 
thoughts and feelings. A suggested measure of empathy is obtained by 
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having an individual judge how another person will describe himself, 
This can be done with interest tests, self-report inventories, the Q-sort, 
the Semantic Differential, or other methods of personality description, 
The judge fills out the form in the way that he thinks the person to be 
predicted will fill out the form. The judge’s marks are then o ympared with 
the actual ratings of the other person. If the two sets of marks correspond 
closely, it is said that the judge has high empathy, at least for the one 
person being judged. A statistical comparison of the two sets of ratings 
can be made either with the correlation coefficient or, better, with the 
D statistic discussed in Chapter 7. 

In addition to measures of empathy, techniques have been proposed for 


measuring a number of other aspects of interpersonal perception. Fiedler 
(7, 8) has performed extensive investigations with a measure named 
“assumed similarity of opposites” (ASO), The subject is asked to rate two 
persons with the Semantic Differential: one individual whom he likes or 


with whom he works well and another individual whom he dislikes or 
with whom he works poorly. Fiedler employs the D statistic between the 
two sets of ratings as a measure of ASO, The larger the D, the more dis- 
similar are judged the high-preference and low-preference persons. 

Fiedler (8) found a relationship between the ASO scores of pilots and 
the effectiveness of bomber crews and between the ASO scores of tank 
commanders and the effectiveness of tank crews. Measures of effectiveness 
consisted of expertness in bombing and tank gunnery respectively. Each 
leader was asked to predict the self-ratings of an individual with whom 
he worked poorly and the self-ratings of an individual with whom he 
worked well. Fiedler found that the leaders of effective crews tended to 
regard their most and least preferred coworkers as relatively dissimilar. 

Measures of interpersonal perception have not yet met with substantial 
success. Many different kinds of rating forms and questionnaires have 
been used in the studies, and it would not be expected that all would 
produce the same results, It is unreasonable to think that an individual’s 
ability to judge the ratings of another person on, say, an interest inven- 
tory will be the same as the ability to judge another person's ratings on 
the Semantic Differential, As is true in all measures of similarity, it is 
meaningless to speak of similarity between persons or between ratings in 
general. There is a current trend toward the use of factored tests rather 
than the use of a miscellaneous collection of items, This will permit a 
more concrete statement of the ways in which sets of ratings are similar 
and will permit a more direct comparison of the results obtained by 
different investigators. 

Measures of interpersonal perception meet with a number of statistical 
and conceptual difficulties, There has been considerable dispute (see 4) 
regarding the proper statistical procedures for analyzing the research 
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data. Although the D statistic is presently the most logical and the most 
widely used, the conglomerate nature of the statistic makes the results 
somewhat difficult to interpret in many studies. One of the conceptual 
difficulties is that some of the measures of interpersonal perception, such 
as the ASO measure, are so complicated, that even if significant results are 


obtained. it is hard to understand what they mean. Another difficulty in 
the measurement of interpersonal perception is the confusion of accuracy 
in the understanding of others with the individual’s knowledge of general 
social stereotypes. The “acceptability influence” was discussed in Chap- 
ter 14, where it was said that there is a tendency for all persons to give 
similar and culturally favored ratings of themselves. Therefore, what may 
appear to be accuracy in the understanding of specific persons may be 


only a knowledge of cultural stereotypes, a knowledge of how people in 
general rate themselves. Current methodological and empirical investiga- 
tions (sce 3, 11) are endeavoring to clean up the statistical and conceptual 
ambiguities in this area, and it is hoped that measures of interpersonal 
perception will eventually be of substantial use. 


MEASUREMENT TECHNIQUES FOR COMMUNICATIONS STUDIES 


The field of human communications is rapidly developing as a meeting 
ground for diverse scientific disciplines and research techniques. Broadly 
considered, much of human behavior can be subsumed under the term 
communications. The basic ingredients in a communications system are 
a source, a person or device that determines the message; a set of anne 
or signs into which the message is encoded; a method of transmission; an 
a receiver, either human or mechanical, that takes in and decodes or gives 
meaning to the message. The system of events takes place in ordinary 
conversation when first an individual has an idea; the idea is formulated 
in terms of verbal language; the sounds carry to the other person’s ears; 
and the human brain acts as the decoding mechanism. oan 

Each step in the communications process offers many eee 4 
research, and many of the instruments discussed earlier 5 ive a 
and in previous chapters can be applied. A vocabulary ma Ti KINE 
part, the range of language signs in the individual repertoire. Sr 
ment tests can be used to measure the differential ianen r n 
ent ways of encoding and transmitting messages. For Se E aor; 
be useful to know whether college classes conducted via 3 ev z cue 
mote the same amounts and kinds of learning im Saleen sy 
ordinary classroom participation. Traditional metho (o Seri DE: 
tudes are useful in learning the EET effectiveness 
sage: i edia of communication. ia be 

ges Be ans stumbling blocks in communications studies is the 
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lack of effective methods for the description of written and spoken 
messages. For example, if it is found that a particular type of written 
message is effective in either conveying information or changing peoples’ 
attitudes, the finding is of no general use unless the m« ge itself can be 
concretely described. An adequate description must consider some of the 
important message characteristics, and only a scanty kn iwledge is pres- 
ently available about the important ways in which messages vary. 

“Message anxiety” is a characteristic which is receiv ing wide attention 
in communications studies. A communication is said to have message 
anxiety if it tends to frighten the audience, uses inflammatory statements, 
and goes into detail about the dire consequences of particular events, At 
present, characteristics like message anxiety rest largely on intuitive 
grounds. It is difficult to state explicitly when a communication has mes- 
sage anxiety or any of the other characteristics bein g used in communica- 
tions studies. Even the few characteristics of messages that are presently 
available must be documented intuitively on a “more and less” basis 
rather than measured directly. It can be expected that further develop- 
ments in the field of communications will depend heavily on the invention 
of measures of message content. The following sections discuss some of 
the measurement methods currently under investigation. 

Readability. One of the primary needs is a measure of the simplicity 
and ease with which messages can be understood. It is common knowl- 
edge that there are different ways of writing the same set of facts, and that 
some ways are more conducive to the reader’s understandi ng than others. 
A number of formulas have been developed (see 5, 9) which purport to 
measure the readability of written messages. The formulas mainly depend 
on counts of the length of words, either in letters or syllables, the fre- 
quency with which words occur in English writing, and the length of 
sentences. The following example shows two ways of stating essentially 
the same facts, the first sentence having low readability and the second 
high readability: 

The primary consequence of protracted cerebration on profound and 
enigmatic problems, if not interspersed systematically with less involved, 
quiescent diversions, is a reaction of tedium. i 

Thinking for a long period of time about difficult problems will make 
you feel tired. 

Numerous examples have been shown (see 5, 9) in which the formulas 
correlate well with intuitive estimates of readability (see Table 17-2). 
However, the concept of readability is difficult to define and somewhat 
misleading. It should not be assumed that readability as measured by 
current counting methods necessarily means good writing. Scientific 
articles are necessarily less readable in terms of readability formulas. In 
scientific writing, complex words must often be used to convey complex 
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thoughts. “Mary had a little lamb” is more readable in terms of the read- 
ability formulas, but it is not likely a better poem than Gray's “Elegy.” 

A message is readable in terms of the readability formulas if it is 
simple and can be understood easily even by persons with relatively little 
education. The readability formulas are most important when the message 
is intended to convey specific information to a wide audience. They would 
be useful, for example, in constructing government pamphlets for the 
general public, dealing with child care, farming methods, safety precau- 
tions, and the like. 


TABLE 17-2. READABILITY SCORES FOR VARIOUS Kinps oF WarirreN MATERIAL 
(From Flesch, 9, pp. 6) 


ee ee ee Á- 


Reading Description F y Syllables per Av. sentence 
ease of ie Typical magazine 100 words length 
Very easy Comics 123 8 
Easy Pulp fiction 131 11 
Fairly easy Slick fiction 139 14 
Standard Digests, Time, mass 
nonfiction 147 17 
50-60 Fairly difficult Harper’s, Atlantic 155 21 
80-50 Difficult Academic, scholarly 167 25 
0-30 Very difficult Scientific, professional 192 29 


—— 


Cloze Procedure. The incomplete sentence is an old device for testing 
the intelligence of children. Taylor (see 24, 25) has developed a similar 
procedure for the study, not of individual differences among subjects, but 
of differences among messages. His method begins with the deletion of a 
certain number of words from the written material. This is done either by 
deleting words randomly or by removing every nth word HRAN 
(A series of empirical studies ê suggests that a good approach is to e ete 
every fifth word in a message. ) The mutilated messages are then given 


to a group of subjects. They are told to fill in the blanks as best they can. 


The score for each person is the number of correct words placed in the 
blanks. The score for the message is the mean number of correct words 
obtaine ubjects. 
eea E below illustrate the use of Cloze proce- 
dure. The first is a low Cloze score (“hard”) message and the second is a 
high Cloze score (“easy”) message: 
coli in the chimpanzee 
aturation to form 


During the conjugation of 
micronucleus undergoes two m 


? Personal communication from W. L. Taylor. 
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four nuclei, _________ which three degenerate and fourth 
divides to produce ________ stationary and migratory pronuclei. 
the exchange of nuclei, ——— — exchanged nuclei fuse 

and ______ divide to form a _______ and a large nucleus. 
The boy picked apples ________ a tree in the _____. He gave 
them to _______ farmer’s wife. She baked apple pie for 


the ______. He ate two big ______. Then the boy helped 
farmer milk the cow. 


Taylor considers the Cloze procedure to be a new approach to the 
measurement of readability, which he regards as having certain advan- 
tages over older methods. Suci and Nunnally * found correlations of .42 
and .48 of Cloze scores with Flesch (9) readability scores and Dale-Chall 
(5) readability scores respectively over a diverse sample of 70 messages. 
This shows that Cloze is related to conventional readability formulas; but 
the correlations are far enough from perfect to raise the question as to 
what additional variable or variables are being measured by the Cloze 
procedure, 

Regardless of whether or not Cloze procedure proves to be a useful 
approach to the measurement of readability, it has interesting properties 
in its own right. Cloze seems to measure the psychological redundancy, 
or repetitiousness, in writing messages. If everyone can correctly fill in a 
deleted word, the word is completely redundant in the message. This is a 
way of saying that the particular word supplies no new information over 
the rest of the message. Cloze procedure and the concept of redundancy 
can be applied to auditory and visual patterns as well as to written mate- 
rial. Notes or melodic phrases can be either systematically deleted or 
deleted on a random basis from musical selections. Subjects can be asked 
to “fill in,” either hum or play, the missing parts. 

Taylor ° is presently applying Cloze procedure to paintings. He sections 
the painting into a large number of square areas. A fraction of the areas 
are removed on a random basis. Subjects are asked to fill in the missing 
forms and colors. By this method he hopes to find systematic differences 
among different styles of painting. 

Redundancy and its measurement by Cloze procedure might eventu- 
ally prove to be an interesting psychological variable. In some forms of 
written material, redundancy is a desirable characteristic, This is so when 
the purpose of the message is to “hammer in” a few important facts oF 
instructions. The interestingness of written messages, paintings, and musi- 
cal compositions might be dependent on optimal levels of redundancy. 
That is, it may be that a certain amount of redundancy is necessary to 


* Unpublished research. 
* Personal communication from W. L. Taylor. 
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make a stimulus pattern pleasing. A musical composition in which each 
succeeding note and pattern of notes is completely unpredictable from 
previous notes sounds chaotic. On the other hand, a composition so repe- 
titious that each succeeding note and phrase can be anticipated with 
very high .ccuracy proves to be dull and uninteresting to the sophisti- 
cated eat 

Content Analysis. A simple way of describing the communication proc- 
ess is to say that it concerns how who says what to whom. Who and whom 


are the source and audience in the communication process. The read- 
ability formulas and Cloze procedure are concerned with how the mes- 
sage (what) is phrased. Although the term “content analysis” is often 


used to speak in general about methods of analyzing the properties of 
messages, it will be used more specifically here to refer to the information 


or meanings which the message embodies. 

Content analysis came into prominence during the first twenty years of 
this century as a means of studying the content of American newspapers. 
Journalists were interested in the subject-matter coverages which pro- 
voked the most public interest. Sociologists were interested in the influ- 


ence of newspaper content on the changing political and legal forces on 
the American scene. Content analyses during this century ranged into 
the hundreds (see 2) and varied from the “dominant images” in 
Shakespearian plays to the analysis of propaganda during the Second 
World War. 


TABLE 17-3. Content ANALYsIs OF NEWSPAPER TREATMENT 
or A MURDER TRIAL 


(Adapted from Berelson, 2, p. 41) 
OO OOOO OOO See 
Morning Sun, News-Post, Evening Sun, Afro-American, 


per cent per cent per cent per cent 
Helpful to James’s case 0 4 10 E 
Destructive to James’s case 34 46 40 F 
Nail 66 50 50 


er 
The results of a typical content analysis are shown in Table 17-3. The 
analysis concerned the differential treatment by a number of pias 
of a Negro, accused of murder, before his trial, The content of the four 
papers was searched for material relevant to the case. Judgments e 
made about the relative favorableness or unfavorableness of each T A 
The number of “helpful,” “destructive,” and “neutral” materials were ve 
transformed to percentages. The results show, among other aor 
the Morning Sun had more space devoted to “neutral content but nothing 
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that could be construed as “helpful” to the accused. The Afro-American 
was the most one-sided of the four papers, the content being predomi- 
nantly “helpful” to the accused. 

Table 17-4 shows an analysis of the literature of the Buy Scouts of 
America and of Hitler Youth. The content was scored in terms of the ends 
that were recommended to the youths. The emphasis in Hitler Youth was 
on the development of the individual as a member of a national com- 
munity, whereas in the Boy Scouts the emphasis was on improving the 
individual as a person. 


TABLE 17-4. CONTENT ANALYSIS RESULTS OF THE GOALS RECOMMENDED IN THE 
LITERATURE OF HITLER YOUTH AND THE Boy Scouts OF AMERICA 


(Adapted from Berelson, 2, p. 38) 


Hitler Youth, Boy Scouts, 


Justification per cent per cent 
As a member of national community 66 25 
As a member of face-to-face group 19 28 
Essential for individual’s own 
satisfaction or perfection 15 áT 


A third illustrative content analysis is shown in Table 17-5. The prob- 
lem was to determine the amount of cooperation between the propaganda 
agencies of Italy and Germany. Content analyses were undertaken of the 
replies made by Radio Berlin and Radio Rome to a speech made by 
Roosevelt. Counts were made of the number of references to each coun- 
try, to the collaborating country, and to the joint effort of the two 
countries. The content of the two radio stations proved to be so different 
regarding self-references as to suggest strongly that little cooperation 
went on between the propaganda efforts of the two countries. 


TABLE 17-5. CONTENT ANALYSIS OF PROPAGANDA REFERENCES BY RADIO BERLIN 
AND RADIO ROME IN REPLY TO THE ROOSEVELT SPEECH OF SEPTEMBER 11, 1941 


(From Berelson, 2, p. 73) 
eee Se ee o 
Self-references in reply Radio Berlin, | Radio Rome, 


to Roosevelt speech per cent per cent 
i ee E S a 
Germany 96 22 
Italy 0 1 
Axis 4 17 


ee 
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Measurement methods for content analysis have been dominated by 
countin techniques. The experimenter seeks something to count that will 
have a bearing on a more general issue. In each of the three examples 
above, something was counted as a way of getting at a broader problem. 
In some studies the simple counting of things supplies all the information 
that is needed, such as would be the case in learning the different amounts 


of newspaper space or radio time devoted to news, sports, drama, and 
advertisement. In other cases, the issue being studied is much more com- 


plex, and the units which are counted are only tangentially related to 
what will be inferred from the results. 

After a decision is made as to what is to be counted or observed in 
message content, a number of steps must be undertaken to effect a sys- 
tematic content analysis. If the content analysis is to describe the treat- 
ment of a topic in a broad range of messages rather than in several par- 
ticular messages, a sampling of messages and message sources must be 


undertaken. If, for example, the problem is to study the different amounts 
of time given to different subject-matter categories in United States 
television programs, a broad sample of programs from United States televi- 
sion networks must be analyzed. In addition it would be necessary either 
to draw randomly or vary systematically the number of programs ana- 
lyzed during different periods of the day. A thorough sampling of this 
kind is a large-scale operation. yi 

After messages are chosen for a content analysis, judges, usually 
referred to as coders, must be selected and trained for the work. The type 
of coders chosen depends on the purpose of the study. For example, if the 
analysis concerns how housewives react to romance magazines, the coders 
would necessarily be drawn from a representative cross section of “home- 
makers.” If the work of the coders is concerned with analyzing what is in 
the message rather than reacting to the message, the coders are likely to 
be sophisticated persons who have some training mm communications 
studies. 


A third step in the organization of a content analysis is the training of 


‘ i i ‘ob if a simple count is being made 
of na cheesy contest ane If se must make complex 


of readily observed content characteristics. ; apl 
judgments about the characteristics of content, a ete 
period may be required. For example, in a recent A a ete ae 
author participated (16), coders were required to judge the p 


absence of 10 factors concerning beliefs about mental heen! be 
portrayed in radio, television, newspapers, and ae training was 
sary to use psychology majors as coders, and ee eth Re 
necessary to teach the meanings and implications of jpa EN 
One essential step before the content analysis isun erta! e 


the reliability of coders. This can be done quite 
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judgments or counts made by several coders on the same material. If the 
coders agree on, say, only 25 per cent of their judgments, either the study 
should be abandoned or additional training should be given the coders. 
Unless the judgments or counts are so simple as to be agreed upon almost 
perfectly by all coders, it is wise to use either two or several coders for 
each message that is analyzed. As is usually the case in the averaging of 
ratings or judgments, a much more reliable set of results wili be obtained 
by the use of more than one coder. 

The major snag in the use of content analysis comes when the question 
being studied is complicated to the point where simple counts or judg- 
ments cannot be used. To illustrate this point, consider (he following 
excerpt from the analyses of Wolfenstein and Leites (29) of the charac- 
ters and themes in American motion pictures: 


It is probably a characteristic American conviction that suffering is pointless 
and unnecessary. In keeping with this, love in American films rarely involves 
suffering. . . . Dangers in American films tend to take the form of external 


violence, not inner conflict. (29, pp. 94, 99) 


Insightful and interesting as are such analyses, it is difficult to see how 
they could be substantiated by measurements or counts of message 
characteristics. 

Content analysis is essentially what is intended in many of the instru- 
ments of clinical psychology. Projective tests, particularly the TAT, are 
concerned with message content and the inferences which it provides 
about the individual's personality. Here, as in numerous other content 
analyses, the questions under consideration are complicated beyond the 
use of simple counts and judgments about content characteristics. New 
and more powerful methods of content analysis are badly needed in many 
areas of research; but for some time to come it can be expected that 
many issues about message content will be studied by an intuitive 
approach rather than with explicit, standardized techniques. 
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CHAPTER 18 


Establishment of Testing Programs 


So far in this book, psychological measurement has been discussed in 
the abstract, without considering the real-life world in which measure- 
ment methods must be applied. The best efforts of the psychological lab- 
oratory can be thwarted by the unknowing or unsympathetic test user, 
and the instruments which are founded on the best of theory may be 
impractical and too expensive for general use. The person who is well 
grounded in the principles of test construction may lack the personality 
and the know-how needed to educate the test consumer, establish con- 
genial relations with administrative officials, and organize a testing pro- 
gram. The purpose of this chapter is to discuss some guideposts for 
applying psychological measurement methods. 

The Measurement Specialist in an Applied Setting. Large-scale testing 
programs are employed most often in vocational guidance, psychological 
counseling, school programs, and personnel selection. Special problems 
encountered in each of these settings will be discussed in turn. Regard- 
less of where the measurement specialist applies his skills, there are some 
general principles that he should follow in order to establish friendly 
relations. 

It is not uncommon for the measurement specialist to meet with consid- 
erable suspicion in the applied setting. The plant manager, for example, 
might regard the introduction of a testing program as a challenge to his 
authority to select and supervise personnel, The workers may fear that 
tests will be used to deprive them of their jobs or to shift them to less 
desired tasks. School teachers often react negatively to disruptions of 
schedule by psychological testing. The measurement specialist applies 
esoteric procedures like multiple correlation, item analysis, and factor 
analysis. These procedures tend to create difficulties in communicating 
research findings, and administrators often feel insecure in the face of 
statistical mumbo jumbo. 

The first requirement for getting along well in an applied setting is to 
communicate frankly about the methods and purposes of testing. It should 
be made clear that the measurement specialist intends to fit in with 
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existing programs and work toward the goals of the group. The testing 


program is intended to supply administrators with information which can 
then be used in behalf of their own goals. Eventually, the testing program 
may have an impact on the goals of an organization and point to better 
ways of doing things; but this is likely to come about only if administrators 
understaiii the research evidence and are able to feel that they are not 
being pushed into decisions. 

Although the measurement specialist may take pride in teaching testing 
procedures to others, his responsibility is principally to speak the language 
of the organization with which he deals or in which he is employed. 
Statistical findings can usually be communicated without recourse to 
mathematical intricacies, by the use of simple graphs and diagrams. 
Rather than provide the organization with only statistical results, how- 
ever, the specialist should also be able to explain the practical implica- 
tions of measurement findings. 


One of the jobs of the measurement specialist is to explain to adminis- 
trative officials what tests can and cannot do. People in general tend to 
overestimate the power of psychological tests. Administrators will even- 
tually be disappointed in the results of a testing program if they think 
that psychological tests will be a cure-all for their personnel problems. 
The proper point of view is that psychological tests are far from perfect 
but that they are quite often better than any other method of evaluating 
people. 

Many administrative officials are not at all familiar with the con- 
struction and validation of tests. It is sometimes assumed that the 
psychologist can make up a test on his own without studying particular 
jobs or performing research. Consequently, administrative officials are 
sometimes surprised to learn that it will be necessary to interview per- 
sonnel (workmen, students, soldiers), give them tryout tests, and perform 
follow-up studies later. The administrator may be sincerely interested in 
establishing a testing program but have little idea that it will ees 
rearrangement of schedules and the absence of men from the job or class- 
toom. These matters should be explained to administrators 1 advance to 
avoid later sources of misunderstanding. ne 

Confusion between tests of the predictor and the assessment type 0! aL 
hinders the development of a testing program. The measurement speci A 
ist is often asked to construct a predictor test when there is no peace 
available. For example, a request may be made for a selection set a 
taxicab drivers. While discussing the problem with pees 
cials, the specialist may learn either that no indices of job pe =e ea 
are available or that the available indices are unreliable. If he cay ned 
ceeds to construct predictor tests, there will be no way to weer one 
Predictive efficiency. It is often necessary to explain that pre 
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cannot be constructed and put in use before more dependable assessments 
of job performance are obtained. 

A second point of confusion involved between predictor and assess- 
ment tests is the assumption that the measurement specialist can, coms 


pletely on his own, manufacture job or classroom performance assess- 
ments. Using the example above, it would not be unexpected to receive a 
request for tests to measure the efficiency of taxicab drivers who are 
presently on the job, It must be explained that assessments concem 
human values, and although the measurement specialist can help imple- 
ment the values, the values themselves must come from those who have 


the responsibility to make decisions. 

Above all, the measurement specialist must steer clear of factionalism 
and dispute within an organization. A testing program can be used puni- 
tively to try to show, for example, that one administrator is more efficient 
than another, The measurement specialist should retain his standing as @ 
scientist and gather facts in an impartial manner; otherwise he will lower 
his value as a research worker and his results will be trusted by few. 

Vocational Guidance. Vocational guidance may be incorporated in a 
school program, a psychological clinic, or in a separate government or 
commercial organization. People usually seek vocational guidance on 
their own or at the suggestion of an employer or teacher. The younger 
people come because they are unfamiliar with job opportunities and 
because they are not sure of their own aptitudes. Older people come 
because they are dissatisfied with their present positions or because their 
physical disabilities necessitate a change of jobs. 

A wide variety of tests is needed in a vocational guidance setting. It is 
particularly necessary to measure different aptitudes rather than to meas- 
ure general intelligence only. Two persons with the same over-all level of 
ability may have very different strong and weak points in particular apti- 
tude factors. Tests must also be available for the “special abilities,” such 
as the mechanical aptitudes. Interest tests are also widely used in voca- 
tional guidance, and they apparently serve their purpose well. Because of 
the variety of tests which must be employed with most subjects, it requires 
as much as one or two whole days to complete the testing and subsequent 
discussion of results. A thorough vocational guidance program is rather 
expensive to maintain. 

In addition to a thorough knowledge of psychological tests, the voca- 
tional counselor must have a wide acquaintance with job requirements 
and employment opportunities. It would, for example, be incorrect to sug- 
gest a particular vocation which would utilize an individual's interests and 
abilities if there is a severe shortage of jobs in that line. 

The counselor does not tell the individual what job or vocation he 
should enter. Tests are not sufficiently exact or all-inclusive to be able to 
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make a decision for the person. Often extraneous factors outweigh the test 
results. A person may appear best suited for a career as a painter or actor; 
but. if he has several dependents, the immediate need for a steady income 
mav necessitate the choice of another vocation. The needs of a person's 
familv, the values of his subgroup, and the desire to live in a certain local- 
ity will all affect decisions about jobs and professional training. The pur- 
pose of vocational guidance is to furnish the individual with information 
about himself and about occupations which, when added to what he 
already thinks and feels, will help him arrive at wise decisions, 

Psychological Counseling. Here we will consider the use of tests to 
aid in making decisions about maladjusted persons. The setting may bea 
psychological or psychiatric clinic or a mental hospital. The purpose is to 
diagnose the source of maladjustment and to decide the kinds of treat- 
ments that will do the most good. Whereas most of the instruments used 
in vocational guidance are group tests, individual tests predominate in 
clinical settings. Individual tests are used because the nature of the per- 
son's maladjustment may make it difficult to use group tests, and because 
the individual testing situation provides many opportunities to observe 


without assistance, pera s ee t be used effectively with 
persons who are severel: or wa, 

The opportunity to oak a person take an individual test is often as 
rich a source of information as the test score. Two children may make 
low scores on a test like the Stanford-Binet for very different reasons. One 
child may try hard, in a slow, strained manner. Another child a give 
wrong answers with a sly look in his eyes and intersperse his wrong 
answers with highly intelligent remarks about the testing situation. ame 
able to note differences of this kind is very helpful in detecting sources 
maladjustment. itis 

The highest level of testing skill is required in the clinical setting. 
usually N difficult to give an individual test, like the Rorschach, to a dis- 
turbed person than to a indi 
lish contact at all or to get <_< ms 
can be e: ed to respond to a 
tion of toe maladjusted individuals will often a gg a 
uncooperative. Clinical examiners must be prepared for adult wie Git 
angry or begin to weep in the middle of the testing ses 
disturbed child may refuse to enter the sig materials. Dis- 
coaxed inside, hide under k table or tear up 
turbed individuals are not always so tuations as 
iner must be prepared for and capable of handling such si 
they arise. 
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The range of tests employed in a clinical setting is usually not as broad 
as that required in vocational guidance. Reliance is placed principally on 
measures of general ability, particularly on the Stanford-Binet and the 
Wechsler scales, and on the Rorschach and TAT as measures of person- 
ality. Although there is occasional use of personality inventories, such as 
the Bernreuter and the MMPI, these are less favored than the projective 
instruments. 

Psychological counselors differ in the extent to which they employ psy- 
chological tests. Some counselors believe that the use of tests starts treat- 
ment off on the wrong foot. They feel that tests encourage the individual 
to think that the counselor is going to give him the “answers” to all his 
problems. Counselors with this point of view prefer to foster the indi- 
vidual’s independence of thought, work with him in solving his own prob- 
lems, rather than advise, manage, and direct. However, most counselors 
feel that tests are very useful in conjunction with treatment. Some coun- 
selors would prefer not to give and interpret the tests themselves, because 
they think that it will foster an unhealthy dependence. They would prefer 
to have the tests administered by some other psychologist and, to the 
extent that it is advisable, have the examiner explain the results to the 
individual. 

School Programs. The emphasis in testing programs changes from ele- 
mentary to advanced education. In elementary and high school, the test- 
ing program is usually tied directly to classroom instruction and the 
management of students. The measurement specialist is usually referred to 
as a school psychologist and usually has extensive training in educational 
methods. He helps compose examinations for particular courses, selects 
achievement tests, and performs diagnostic testing for those children who 
are meeting with difficulties. It is becoming popular to give all children 
a standard battery of tests, usually consisting of a group intelligence 
measure, interest inventories for older children, and sometimes attitude 
and personality inventories. 

Contrary to the procedure used in some other countries, there are no 
formal selection procedures in the United States to decide who can and 
who cannot enter grammar school and high school, Anyone can go on to 
a higher grade if he passes the next lower one. Consequently, tests are 
very seldom used for selection in elementary and high school. Tests are 
becoming increasingly popular in educational research, to find better 
methods of instruction and better ways of organizing curricula. 

In the late years of high school and in college, one of the major func- 
tions of a testing program is to aid vocational guidance, which was dis- 
cussed in a previous section. Selection is one of the primary jobs in many 
college and university testing programs. Many private colleges and some 
state-supported institutions admit only those applicants who show prom 
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ise of doing well scholastically. In conjunction with high school grades, 
psychological tests are the major selection instruments. Tests for select- 
ing college freshmen are used very widely and they serve the purpose 
well. The predictive efficiency of the tests can be measured straight- 
forwardly by correlating scores with grade point averages obtained later. 
Because the number of college applicants is steadily increasing beyond 
the facilities of educational institutions, it is expected that psychologi- 
cal tests will come even more prominently into play as selection instru- 
ments. 

Personnel Selection and Management. Tests are being used widely in 
military, civil service, and industrial work to decide which persons will 
be given particular jobs or positions and to organize working conditions 
for maximum efficiency, safety, and pleasantness. Specialists who con- 
cern themselves with these matters are usually referred to as personnel 
psychologists. 

Some of the strategy of selecting people for particular jobs was dis- 
cussed in Chapter 5. The efficiency of selection depends not only on how 
well predictor tests correlate with the assessment of job performance but 
on the selection ratio and the success ratio as well. Even the most valid 
predictor test will be of little use if the number of applicants for a job is 
hardly more than the number of vacancies. Even with a low (favorable ) 
selection ratio, a test or test battery will serve little purpose if the success 
ratio is very high. That is, if the job is so simple that almost anyone cea 
at random could perform satisfactorily, there is little need for psycho- 


logical tests. f h and 
The ; logist usually must perform more research a 
ie personnel psychologist usually 5 h methods than per- 


usually needs more training in measurement researc 
sons who work in vocational guidance, clinics, and school programs. In 
personnel selection it is often necessary to construct and validate Te for 
the particular employment setting. Jobs that have very similar UUES Day 
differ in their actual duties, and consequently, specialized selection instru- 
ments are often required. This means that the personnel ace 
must either try out a range of commercially i ne ad Fere 
tests for his particular purpose. Even if commercially distribute y og 
used, it is often necessary to invent special test anaes Se p 
dures of administration and to gather norms in the local setting. SIIA 
Successful performance in a particular job is only one of ee eee 
selecting personnel. Equally important is whether a FERA a eee ay 
rise to higher positions in the institution. ca w eee ie ges 
moderately predictive of success in a low-level jo ches athe asoni 
a predictor of long-range achievement. jari a Sei hed by clerical 
between intelligence test scores and job HE] a Sane a ee ihe 
workers. Whereas 87 per cent of the persons W! oe 
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TABLE 18-1. RELATIONSHIP BETWEEN INTELLIGENCE Test SCORES OF CLERICAL 
Wonkers at Time Tuey Were HIRED AND JOB LEVEL THEY ACHIEVED LATER 


(Adapted from Pond and Bills, 9) 


Intelligence test Low job, Middle job, High job, 
score per cent per cent per cent 
180 and above 4 42 54 
160-179 9 71 20 
140-159 33 59 8 
139 and below 87 13 0 


test remained in a low-level job, only 4 per cent of the persons with 
highest test scores failed to move up to higher positions. 

One of the major variables to consider in selecting personnel for some 
jobs is labor turnover. That is, the job may be easy to learn and most per- 
sons may perform well enough, but many employees leave the job after a 
short time. Because it is expensive to “break in” new employees and 
because a high turnover disrupts operations, it is important to select 
persons who are likely to remain on the job for a relatively long time. 
Tests can be used to predict the likelihood that an individual will remain 
on a job. Table 18-2 shows the relationship between intelligence test 


TABLE 18-2. RELATION OF INTELLIGENCE TEST SCORES TO LENGTH OF SERVICE 
For A GROUP OF CASHIERS AND INSPECTOR-WRAPPERS 


(Adapted from Viteles, 10) 


Average length of 
Test score service, days 
90 and above 35 
80-89 87 
70-79 96 
60-69 100 
50-59 107 
40-49 142 
30-39 91 
20-29 91 
10-19 8 


scores and length of service in one job setting. In that instance, people in 
the middle intelligence range are more likely to remain on the job for a 
long period. It is likely that interest, attitude, and personality tests will 
also be useful in predicting job turnover. 

In selecting individuals for some jobs, the major concern is how safely 
the individual will perform. This is particularly so when the individual 
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will be responsible for the lives of other persons, as is the case with bus 
drivers, train engineers, and airplane pilots. Tests can be used to 
the tendency to have accidents on a particular job. Table 18-3 shows the 


TaBLE 18-3. THE VALIDITY OF VARIOUS TESTS IN THE PREDICTION OF ACCIDENTS 
or TAXICAB DRIVERS 


(Adapted from Ghiselli and Brown, 6, p. 349) 


Type of test Validity coefficient 
Dotting 35 
Tapping AT 
Judgment of distance 18 
Distance discrimination 20 
Mechanical principles H 
Arithmetic —.09 
Speed of reaction —04 
Interest 28 


validity of various tests as predictors of accidents by taxicab drivers. In 
that instance the two motor skills of dotting and tapping (both con- 
cerned with the rapid and precise use of finger and arm muscles) are the 
best predictors. The most predictive tests tend to be different for different 
jobs, and in only a few cases have tests been successful in predicting who 
will have accidents. 


SOURCES OF TEST INFORMATION 


Although numerous tests have been used in previous chapters to illus- 
trate principles of human measurement, this book is not intended to be a 
complete reference source for available tests. Many tests are developed 
each year, and it is nearly impossible to be well informed about even * 
sizable percentage of the new instruments. Therefore, the pes re x 
works with psychological tests needs to be familiar with lists can 
ae available tests, digests of research findings, and reports on 
methods and procedures. A i 

Available Tests. A list of the most prominent test publishers is pr 
sented in Appendix 1. It is the usual practice of each cino y a 
tribute a yearly catalogue of available tests. The person who wor most 
testing program will want to request test catalogues from aie a abort 
of the major commercial firms. The catalogues give pac = pe of any 
descriptions of the tests. Most companies will send a game ee 
printed test. Prices range from less than a dollar up to pia in ine 
depending on the test. Test catalogues are very helpful for keep £ 
formed about new tests. It is too much 


to expect that the test publisher 
will be completely unbiased in the claim 


s for a new instrument. Critical 
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evaluations are best obtained from some of the research reports which 
will be discussed in the following sections. 

In addition to the test catalogues, a number of test bibliographies are 
available. Hildreth (7, 8) published a list of 5,294 tests, covering the 
major instruments that had been developed up to 1945. The tests are 
listed by topic, in order that a person who is interested in, say, manual 
dexterity tests can most easily find the instruments that are available. 
No effort is made to present research findings in respect to the tests or to 
evaluate them in any way. References as to where the test is published 
or where it is described in some detail are given for each test. Other lists 
have been prepared for special testing areas. 

Administration and Use of Tests. After one or several tests are selected 
as likely candidates for use in a particular setting, it is necessary to learn 
how the test is administered and scored. This information is usually sup- 
plied by test manuals, which can be purchased along with or separately 
from particular tests. Manuals usually describe how the test was con- 
structed, provide some norms, and in many cases report research findings 
on reliability and validity. The amount of research evidence given in the 
manual and the nature of the findings will often provide helpful sugges- 
tions as to whether or not a test should be tried in a particular situation. 

A particularly useful publication for all those who deal with psycho- 
logical tests is “Technical Recommendations for Psychological Tests and 
Diagnostic Techniques” prepared by the Committee on Test Standards of 
the American Psychological Association.’ It provides numerous sugges- 
tions for the development, validation, commercial distribution, and use of 
psychological tests, 

Research Evidence. It is important before accepting a test to examine 
the research evidence which has accrued from applications to different 
measurement problems. Although a test may look quite promising when 
it is published, it takes several years or more to judge how well it works 
in practice. One source of research findings is the Journal of Applied Psy- 
chology. You would search there if you were interested, for example, in 
the ability of the Strong Interest Inventory to select salesmen. The Jour- 
nal of Consulting Psychology has reviews of new tests and research 
reports on older tests, primarily those instruments which are used most 
widely in clinical psychology. Summaries of research in whole areas of 
testing, such as the application of personality inventories in military psy- 
chology or the use of tests to measure brain damage, appear occasionally 
in the Psychological Bulletin, 

The most detailed sources of research information about tests are the 

* The publication appeared as a supplement to the Psychological Bulletin, 1954, 51, 


Part 2. A copy can be obtained by mail for $1 from the American Psychological Asso- 
ciation, 1333 Sixteenth Street, N.W., Washington 6, D.C. 
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Mental Measurements Yearbooks prepared by Buros (2, 3, 4, 5). The four 
Yearbooks currently available were published in 1938, 1941, 1949, and 
1953 respectively. (The gap between 1941 and 1949 was due to the 
Second World War). Each Yearbook is a large volume containing research 
findings and critical reviews of many different tests. The most prominent 
tests receive reviews by several experts. In addition, the Yearbooks supply 
considerable factual information about most tests: administration time, 
cost, scoring procedures, and populations for which the instruments are 
most suitable. Numerous references are provided to other publications 
that report research evidence on particular tests. The Yearbooks are a 
“must” for any person who deals extensively with psychological tests. 
Research Methodology. Equally important to keeping up with new 
tests as they are developed is staying abreast of new logical and statistical 
developments in psychological measurement. Information concerning 
many of these developments can be obtained from current textbooks on 
psychological measurement. Some of the professional journals also report 
recent advances in methodology. Educational and Psychological Measure- 
ment is primarily concerned with new conceptions in human measure- 
ment, new outlooks on test reliability and validity, new methods of test 
construction, and practical problems of using tests. Psychometrika is 
deyoted largely to measurement logic and mathematical statistics, much 
of which is pertinent to the development and use of psychological tests. 


PRACTICAL ARRANGEMENTS 


When tests have finally been selected for use in a measurement ane 
gram, there are numerous practical considerations which will ewe the 
efficiency and economy of their use. Many of the practical problems ae 
specific to particular testing situations; however, there are ore sae 
and procedures that will generally help in maintaining à smoothly ope 
ing mez ent program. 

Physical Setting. Apne consideration is to obtain an ee 
ing room. It should be well lighted, free from noise and ~ s ci 2 
and equipped with the necessary number of desks or “nen a ti “ia 
is best to have a particular room permanently ee or tes regi ce : 
it can be equipped for the testing program. Specia TE ord 
wiring for electrical devices, cabinets for storing test aor j 
projectors, public address systems, and others, may be requir a AM 

Administration and Scoring. In most cases the pre ei abt as 
examinees why particular tests are being administere re ae 
should be completed. The instructions should be su r a Deas 
cover most of the problems that will come up. For T R mol cael 
kind of pencil is used with the test, the subjects sho 
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more pencils are available if a point breaks. If the examinees are allowed — 
to ask questions during the test, they should be told the kinds of ques- 
tions that will be permitted and also how to summon the examiner. When 
a large number of persons are being tested at one time, it is usually better 
to inform them that it will not be possible to answer questions once the 
testing is under way. Subjects must be told what to do when they com- 
plete the test or tests: go back and work on the problems some more, leave 
the room, or remain seated without looking back at the materials. 

It is usually necessary to employ several assistants when testing a large 
group of persons. They can answer questions if questions are permitted. 
They can help direct people to their proper work spaces, give out mate- 
rials, and collect them when the testing is over. Assistants are also helpful 
in preventing cheating. They need have little knowledge about the nature 
and purpose of the tests being used. 

It is very helpful to “role play” the administration procedure before test- 
ing groups of subjects. This consists of going through all of the adminis- 
tration routine with a few people acting as subjects. They are asked to 
note any point at which the testing procedure is unclear. One subject may 
notice that at the bottom of the first page there is no information to indi- 
cate whether to go on to the next page or stop there. Another subject may 
not know whether or not he is to guess when unsure of the answers. A 
trial run of this kind will offer many suggestions for clarifying the admin- 
istration routine. 

Analysis of Test Data. High-speed computational machines are rapidly 
replacing the relatively slow human hand and human mind. Special 
answer sheets are now widely used for testing large groups of subjects. 
The answer sheets can either be scored with a stencil or, much faster, by 
machine (International Business Machine Corporation is one of the best- 
known distributors of such equipment). Hundreds of answer sheets can 
be scored in an hour's time. 

It facilitates the subsequent analysis of test results if scores are recorded 
on punch cards. One card can store a considerable amount of information 
about a person: scores on a number of tests; assessment scores, such as 
job ratings and production records; and personal data, such as years of 
experience and marital status, High-speed computational equipment can 
then be used to find particular kinds of persons, such as those who score 
above a certain level on a test, those who have three years of experience, 
and those who are unmarried, Also, it is even more important that 
machine computers can make relatively fast work of the complex statis- 
tics which are often required in the construction and use of tests. A few 
years ago it would have been prohibitively laborious to undertake a 
factor analysis of 100 test items. Now it requires only several hours of 
digital computer time to complete the job. 
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THE USE OF HUMAN SUBJECTS 


The life sciences (including the biological and social sciences) and 
physical science have many points in common. Both search for lawfulness 
in nature, value objectivity above opinion and desire, and employ many 
of the same conceptual and mathematical tools. They differ mainly in that 
the life sciences deal with people whereas physical science is mostly 
concerned with inanimate objects or subhuman animals. Thus, the work- 
ing routine of the psychologist is vastly more difficult than that of the 
physical scientist for two reasons. First, human behavior obeys no simple 
laws, and consequently, progress in psychology is often slow. Second, the 
psychologist must consider the feelings, needs, and safety of his human 
subjects. The physicist who is interested in the structure of crystals can 
pound them, burn them, freeze them, and work without considering their 
“feelings.” The psychologist must be very careful not to harm or, in many 
cases, even discomfort his human subjects. This section will discuss some 
of the ground rules for dealing with human subjects in psychological 
experiments and testing programs. : 

The Use of Disturbing Materials. It is often necessary to frighten, 
embarrass, or disturb people in order to carry out an experiment or test 
a particular attribute. For example, if you want to learn the effect of 
anxiety on memory, it is necessary to create human anxiety to perform 
the experiment. This might be done by an electric shock, a series of 
embarrassing questions, or a film of a surgical operation. In some studies 
it is necessary to hypnotize a subject, isolate an individual in a dark room 
for several hours, or have someone act out the part of a murderer. The 
first rule in dealing with material that is likely to create strong disturbance 
is that all subjects should be informed of the nature of the materials before 
the experiment. They need not necessarily be told what the pines 
hopes to find or all of the conditions of the study, but they should at least 
be told about the source of disturbance which will be employed. Sec- 
ondly, the subjects should all volunteer without undue ys T 
the examiner, and they should be allowed to discontinue the experi 
if they so desire. aya i i 

Obtaining Subjects. One of the primary difficulties j p E oe 70 
ments or in developing tests is to obtain enough su ite material he 
physical scientist can often obtain large quantities © often difficult to 
needs, and procurement is only a financial problem, it F nnd They 
contact human subjects and persuade them to participa sr nee 
are often not interested in the study, or, like most people, they 
with jobs, home life, and recreationa 


] pursuits. Consequently, ori 
studies are performed on college students, simply because they are the 
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most available subjects. There is a growing awareness that many research 
findings are different when the subjects are recruited from the commu- 
nity at large rather than from students only. 

It is generally bad practice to rely on volunteers for developing tests 
and for research studies, although, as was said in the previous section, 
volunteers should be used when disturbing materials are part of the study. 
The people who volunteer are usually more knowledgeable about the 
research area, women tend to volunteer more often than men, and it is 
likely that the personalities of volunteers and nonvolunteers are somewhat 
different. In some cases it is possible and not unethical to assign whole 
groups of individuals to an experiment. This can be done in some school, 
industrial, and military situations. When subjects are not obligated to 
participate in a study, as would be the case in studying the attitudes of a 
particular church congregation, the experimenter should try to enlist as 
many of the people as possible. There will always be some people who 
either cannot or will not participate, but if as many as 80 or 90 per cent 
are obtained, the results will probably not be strongly influenced by 
“volunteer bias.” 

Giving Information. People are sufficiently curious to want to know 
what a research study is about; and when psychological tests are used, 
they will want to know how well they perform. It must be decided prior 
to a study just how much information can be given to subjects. The sub- 
jects should be informed before the experiment what they can be told 
later about the results. 

One place in which decisions must be reached about providing informa- 
tion to subjects is in the routine administration of a testing program. 
There are certain testing situations in which the subject is usually in- 
formed in some detail about the test results, as would be the case in voca- 
tional guidance. At the other extreme, the subject is usually not told the 
results of personality tests given in a clinical setting. The information 
might disturb him and raise more questions than are answered. In 
between these extremes there are situations in which it is difficult to 
decide how much information should be supplied to the subject and to 
other people. 

Subjects generally should not be told the results of tryout instruments 
whose worth has not yet been established. The experimenter may think 
that he is employing a measure of neuroticism, but subsequent research 
studies show that it does not measure what is intended. It would be a real 
injustice to inform a subject that he has a high neurotic tendency on the 
basis of the test. In general, the less known about the validity of the 
instrument, the fewer are the grounds for informing subjects of test 
results, 
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Disguising Materials. Participants in psychological studies are often 
wary that some trick is involved, that the real purpose of the study is dis- 
guised. Many research studies and many tests would not be successful 
withou! either omitting information about the experimental purpose or 
purposefully misinforming subjects. For example, in a study of visual 
illusions, the subject is not told that the purpose is to study distortions in 
perception. If subjects are informed of the illusion, they will not report 
what they see but what they think they should see. It is often necessary 
to disguise the purpose of attitude and personality measures. A good 
example is the situational test described in the previous chapter. The 
subject is told that two men will help solve the problems, whereas their 
real purpose is to hinder and distract. 

The border line between harmless disguise and harmful deception is 
sometimes difficult to define. Surely there should be no more deception 
than is necessary for the experiment. If the deception is of a kind that may 
cause lingering worry for the subjects, they should be informed later of 
the deception and told why the deception was necessary. It is much 
easier to defend disguise in the experimental material than deception 
regarding the use of measurement results. For example, most psycholo- 
gists would consider it highly unethical to tell subjects that test results 
will be kept anonymous and then disclose them later, or to tell subjects 
that test results will be used for research purposes only and later use 
them for personnel selection. 

Human Reaction to Human Measurement. Finally, we should return to 
a point mentioned in the first pages of this book, that psychological pee 
urement is often reacted to as an unfriendly intrusion into human lives. 
There is a wide range of reaction from people who take psychological 
tests or participate in research studies; but it is often easy to mage an 
underlying tension in even the outwardly most casual subject. Most 0 = 
cherish the belief that we are somehow better than our reputations an 
gifted beyond our positions in life. Psychological tests often E. 
ego challenge, a type of objective evidence which may ao al ar 
ities and personal deficiencies. Although we can rationalize our every) 
failings, it is hard to argue with test scores. x 

A oid question ea of psychologists isd os é bieles 
score on a manual dexterity test in high school. What oes SA 
Instances of this kind show that people are sensitive to PaSa m 
on psychological tests, and test scores may influence peop ; oat a ji 
: f evar ities. 
about their own abilities and personal ties. Another place 


à ment is in the class- 
easy to see the popular anxiety over human aaa ay be Saal and 
AAR en 

r inati ‘on. Although some s 
oom examination situati g nsion, such as frown- 


unconcerned, the majority show obvious signs of te 
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ing, nail biting, gritting teeth, and other forms of “nervousness.” The ten- 
sion is even higher when the graded examinations are returned. 

The person who is new in the field of human measurement may be sur- 
prised at the amount of hostility sometimes shown by subjects after a 
research study or after taking tests. For example, after administering a 
tryout personality test, a number of the subjects are likely to come up 
afterwards and say that the test is no good (they hope) or that the study 
will show nothing. This is often a way of blowing off steam because of the 


tension brought on by the test, and it is part of the experimenter’s job to 
accept the hostility and even encourage its free expression. A special 
follow-up meeting is often used, called a “cathartic session.” to allow 


people to vent their feelings about a study. This allows a give-and-take 
discussion between experimenter and subjects, lets the subjects feel 
that they are more than “guinea pigs,” and usually reduces tension 
considerably, 

The importance of human measurement far outweighs the tension and 
embarrassment that is sometimes generated. We may all be created equal 
in the social sense but we are not all equal in terms of abilities and per- 
sonality characteristics. It would be a great social waste not to help the 
talented achieve as much as they can, and a great social harm to encour- 
age unsuited individuals to labor in ventures that will bring failure and 
discouragement. Taking a frank look at our own abilities and personality 
characteristics is not always pleasant at the moment, but it is necessary 
for long-range achievement and happiness. 
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Acorn Publishing Company, Rockville Centre, N.Y. 

Association Press, 291 Broadway, New York 7, N.Y. 

Bureau of Educational Measurements, Kansas State Teachers College of Em- 
poria, Emporia, Kans. 

Bure211 of Educational Research and Service, Department of Publications, State 
University of Iowa, Iowa City, Iowa. 

Bureau of Publications, Teachers College, Columbia University, New York. 

California Test Bureau, 5916 Hollywood Boulevard, Los Angeles 28, Calif. 

Educational Test Bureau, 720 Washington Avenue, S.E., Minneapolis 14, 
Minn, 

Educational Testing Service, 20 Nassau Street, Princeton, N.J. 

C. A. Gregory, 345 Calhoun Street, Cincinnati, Ohio. 

Grune and Stratton, Inc., 381 Fourth Avenue, New York 16, N.Y. 

Harvard University Press, 79 Garden Ste E Mass. 

Houghton Mifflin Company, 2 Park Street, Boston /, Mass. s 

Doncaster G. Humm Paad Service, P.O. Box 1433, Del Valle Station, 
Los Angeles 15, Calif. t 

Marietta Apparatus Company, Mair onr P 

University of Minnesota Press, Minneapolis 14, Minn. 

National EES for Industrial Psychology, Aldwych House, London, WG, 


England. Y. 
The Psychological Corporation, 522 Fifth Avenue, New York 36, N. . 
Public School Publishing Company, 509 North East Stri se oa 4 
Science Research Associates, 57 West Grand Avenue, ee a 
Sheridan Supply Company, P.O. Box 837, ae eed Calif. 
Stanford University Press, Stanford University, Calif. Chicago 24, Ill. 
C. H. Stoelting Company, 424 North Homan PE : icag j 
T e College, Columbia Ua ay Hills, Calif 

estern chological Services, Box 799, TE : i 

Williams e Da Company, Mt. Royal and Guilford Avenues, Baltimore, 
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World Book Company, 313 Park Hill Avenue, Yonkers, N.Y. 
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Scale Transformations 


A problem that is often encountered in using psychological tests is that of 


transforming an obtained set of raw scores to a set with a particular mean and 
standard deviation. For example, it might be found that the mean of the 
obtained raw test scores is 40 and that the standard deviation is 5 In order to 
compare scores on the test with scores on another test, or in order to place the 
scores in an easily interpretable form, it might be desired to transform the raw 


scores in such a way that the new scores have a mean of 50 and a standard 
deviation of 10. Transformations of this kind can be performed with the fol- 
lowing formula: 


where X, = scores on the transformed scale 
X, = scores on the obtained scale: raw scores 
Mo, M, = means of X, and X + respectively 
To o; = standard deviations of X, and X + respectively 


The formula can be applied to the problem illustrated above as follows: 


10 10 j 
K (pati = Ee Ša 
'=(3)*-[(5)2-=] 
= 2X, — 30 
By this transformation a raw score of 40 would be transformed to a score of 50, 
and a raw score of 25 would be transformed to a score of 20. Because the 


formula is a linear transformation, it does not change the shape of the score 
distribution, 
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APPENDIX 3 


Proportions of the Area in Various Sections 
of the Normal Distribution ° 


Standard Area Area 
score (x/o) between beyond 

(1) (2) 3) 
+ and — 0.00 0.0000 1.0000 
0.05 .0392 .9602 
0.10 .0796 9204 
0.15 1192 .8808 
0.20 .1586 8414 
+ and — 0.25 .1974 8026 
0.30 .2358 7642 
0.35 2736 7264 
0.40 .3108 6892 
0.45 3472 6528 
+ and — 0.50 3830 6170 
0.55 A176 5824 
0.60 4514 5486 
0.65 4844 5156 
0.70 .5160 4840 
+ and — 0.75 5468 4532 
0.80 5762 4238 
0.85 6046 3954 
0.90 6318 3682 
0.95 6578 3422 


° If, for example, in a normal distribution of test scores, you want to estimate the 
number of persons who make scores between plus one standard deviation of the mean 
and minus one standard deviation of the mean, you would look opposite 1.00 in the 
first column at the proportion in the second column, There it is seen that the propor- 
tion is .6826 or, in other words, approximately 68 per cent. This means that ase 
mately 32 per cent of the individuals make scores either greater than one standa 
deviation above the mean or less than one standard deviation below the mean. If you 
want to determine the proportions of people who lie within or beyond certain onae 
score units above the mean only or below the mean only, the proportions m 


umns 2 and 3 should be halved. 
421 
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z 
Standard Area Area 
score (x/o) between beyond 
(1) (2) (3) 
+ and — 1.00 .6826 174 
1.05 .7062 2938 
1.10 .7286 2714 
1.15 .T498 .2502 
1.20 .7698 2302 
+ and — 1.25 -7888 2112 
1.30 .8064 .1936 
1.35 8230 .1770 
1.40 8384 1616 
1.45 .8530 1470 
+ and — 1.50 .8664 1336 
1.55 .8788 1212 
1.60 8904 .1096 
1.65 9019 .0990 
1.70 .9108 .0892 
+ and — LYS .9198 .0802 
1.80 9282 .0718 
1.85 .9356 .0644 
1.90 .9426 .0574 
1.95 9488 0512 
+ and — 2.00 9544 0456 
2.05 9596 .0404 
2.10 .9642 .0358 
2.15 .9684 0316 
2.20 9722 .0278 
+ and — 2.25 9756 0244 
2.30 .9786 0214 
2.35 .9812 .0188 
2.40 .9836 0164 
2.45 .9858 .0142 
+ and — 2.50 .9876 0124 
2.55 .9892 .0108 
2.60 .9906 .0094 
2.65 .9920 .0080 
2.70 .9930 .0070 
+ and — 2.80 .9948 .0052 
2.90 .9962 .0038 
3.00 .9973 .0027 
3.10 .99806 .00194 


3.20 -99862 00138 


+ and — 


z 
Standard 
score (x/c) 


(1) 


3.40 
3.60 
3.80 
4.00 


4.50 
5.00 
6.00 
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Area 
between 


(2) 


99932 
99968 
.999856 
9999366 


.9999932 
.9999942 
.999999998 


APPENDIX 4 


Derivation of Measurement Error Formulas 


Currently there are two mathematical models from which measurement error 
statistics can be derived. Although the same statistics are dev eloped from both 
models, the models make different assumptions about the data. The first model 
to be examined considers any particular test as one from a universe of possible 
tests. The following assumptions are made in the model: , 

1. All of the sample tests in the universe of tests have the same variance. 

2. All of the sample tests correlate the same with one another. That is, the 
correlation of test 1 with test 2 is the same as the correlation of test 1 with test 3, 
and the same as the correlation of test 2 with test 3, and so on. pe” 

3. All of the tests correlate the same with some outside variable (a criterion 
to be predicted, a test from another universe of content, or any other variable). 

If the above assumptions hold, the correlation of the sum, or average, of all 
the sample tests in the universe (which is usually spoken of as the set of “true 
scores) with the criterion variable is obtained as follows: 


Ezy (Z1 + Z2 + zy + blair a) 


Viz Vla F Fat Fz)? 


where T yczz) = the correlation of the criterion variable, y, with the sum of the 
standard scores on all sample tests in the universe 
% = standard scores on sample test 1 
Zn = standard scores on test n 


yz) = 


The equation above can be expanded as follows: 


a tat ty + tye too + tym 
ee et aa FT ton, 


Titz) 


where the denominator is the square root of the sum of all the coefficients in 
the square matrix of intercorrelations of sample tests. 


Because of assumptions 2 and 3 in the model: 


rs nr ree 
Tysa = a eT 
Vint nny = nry 
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r, 
P" E: 
Limit, as n> ©,T is.) = F ixj 
tj 


What the formula says is that, if the assumptions in the model hold, the corre- 
lation between a criterion and a set of “true” scores is obtained by dividing the 


correlation between a particular test and the criterion by the square root of 
the correlation between the particular test and any other test in the universe. 
The reader will recognize this as the correction for attenuation in one variable. 
The correlation of any particular test with the sum of all the sample tests in 
the universe can be obtained as follows: 
a Iyl + Ze tm + °° +2) 


fi si 
ia Jee (Sl tate Fa 
and because of the assumptions made in the model: 
_ IFs 1)ry 


f = 
z: 
sel Vn F ery Mij 


ixj 


——— 


= : “a 
Ni ated EL 


Tij 
Limit, as n> ©, Tisz) a 
Tij 


=vV's 


What the formula says is that, if the melee roe scores from the tests in the 


tion of a set of test scores with the sum of all of 
the correlation betw 


correlated are referred to 


and the other statistics which are needed in fr 
course the conditions of the model will not ai Day ey will corre- 
Sample tests (equivalent forms) wi th ir intercorrelations will not all 
late differently with any outside variable, and their nly estimated by the corre- 
be equal. Consequently, the reliability peer pes 

lation of two particular sample tests, or equiv be met in determining measure- 
model is that it shows what assumptions musy 
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ment error statistics, and it provides an indication of the way in which depar- 
tures from the model can be studied. 


There are several disadvantages of the above model for determining measure- 
ment error statistics. One difficulty is that the model is based on a sampling 
theory, which considers any particular test as one from a universe of possible 
tests. It is difficult to define universes of test content and difficult to ensure that 
all tests are from the same universe of content. Another disadvantage of the 
model is that it requires an infinite number of sample tests, and in consequence, 


the infinity sign appears in the derivation of measurement error statistics. Infinity 
is, of course, only a handy fiction when discussing empirical data. However, the 
model should hold with good accuracy if a large number of sample tests, say 


100, is studied. Although studies of this kind have not yet been undertaken, it 
would be possible to develop 100 similar tests, such as 100 short spelling, 
vocabulary, or arithmetic tests and study the extent to which the model fits 
empirical data. 

The second model which is used for the derivation of measurement error 


Statistics begins by defining an obtained score as the sum of a “true,” or uni- 
verse score, and an error score: 


X= X, +e 


where x is the score obtained by an individual on a particular test 
x, is the individual’s true score 


e is the error or chance portion of the obtained score 


The first assumption in the model is that the error scores are uncorrelated with 
true scores. Therefore 


T = Ty + a? 
If there are two measures of the same trait, the true score is expected to be the 
same in both measurements, but the error scores are not necessarily the same. 
The two obtained scores can then be broken down as follows: 
Xi = Xy +e, 
Xo = Xy + eg 


The two measures are usually spoken of as equivalent forms, or sometimes they 
are called parallel or comparable forms. 


The reliability coefficient is defined as the correlation between two equiva- 
lent forms: 


Xxix 
Nois, 
= ¥(%u, + €1) (Xu, + e2) 
Norg 
= xy? + Bx, C2 + Etue + Leyes 


Norg 


y= 


Several assumptions are then made: 


1. As in the first model, it is assumed that the two equivalent forms have 
equal standard deviations, and, consequently, equal variances. 
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2. It is assumed that error scores on each test are uncorrelated with true 
scores on both tests. 
3. It is assumed that error scores on one test are uncorrelated with error scores 


on the other test. 
issumptions 2 and 3 hold, three of the terms in the numerator of the above 
on vanish, and the correlation between equivalent forms reduces to the 


following: 


ot 


T= 

7109 

Because of the assumption that the standard deviations are equal on the two 
equivalent forms, the reliability coefficient may be expressed as 


2 

oy 

fii =e 

ot 

What the formula says is that, if the assumptions in the model hold, the correla- 

tion of two equivalent forms equals the ratio of the “true” variance to the vari- 

ance of either measurement or, regarding the ratio as a percentage, equals A 
per cent of non-error variance. This is the same conclusion that we obtain 


from the first model. 5 
The standard error of measurement can be obtained through the following 
developments: 
o=o Ho? 


Both sides of the equation can then be divided by the standard deviation of x: 
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T feet o,2 oe 
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estimates are made in general. Consequently, advocates of the second model 
have begun to think in terms of the relationships among collections of equivalent 
forms, all of which have the same variance and all of whose intercorrelations 
are equal. Therefore, a sampling theory is creeping into the second model; and, 
as the necessity for considering any test as one from a possible universe is 
brought into the model, the second model presented begins to collapse into the 
first model presented. 

There are other reasons why the first model presented is probably better than 
the second. In particular, some of the constructs in the second model cannot be 
measured directly. Neither “true scores” nor “error scores” can be measured 
directly and can only be inferred from correlations among equivalent forms. 
The first model presented does not require these nonoperational constructs. 
Another mark against the second model is that some of the assumptions are 
probably not true in particular testing situations. On some occasions it is likely 
that errors will be correlated with “true” scores and errors on one test will be 
correlated with errors on another test. The first model does not require either of 
these assumptions, 

It is comforting to know that two different, or apparently different, models 
arrive at the same formulas. This gives us some confidence in applying the 
standard measurement error statistics, such as the correction for attenuation and 
others. Future research should indicate which of the two models is more gen- 
erally useful in studying the effect of measurement error in psycliological tests. 


APPENDIX 5 


The Influence of Test Length on Reliability 


If reliability is determined by the split-half method, the correlation between 
the split-halves is the reliability of half the test rather than the reliability of the 
whole test. Of course the whole test is employed in applied work, and the reli- 
ability of the whole test is larger than that of the half-tests. The reliability of the 
whole test can be estimated as follows: 


Pi fah 
11 IF Ta 


where r}; is the reliability of the whole test 
fan is the correlation between the split-halves 
‘ormula.” The use 


The formula is referred to as the “Spearman-Brown prophecy fi 
of the formula can be illustrated with a correlation between split-halves of .80: 
__ 2(.80) 
mT + 80 


_ 1.60 
~ 7.80 


= .89 


Thus the best estimate of the reliability of the whole test is .89. 


What the Spearman-Brown formula does is to estimate the reliability of a test 
twice as long (twice as many of the same kinds of items) as the test from which 


the reliability is actually computed. The formula can be extended to estimate 
the reliability of a test lengthened by any particular amount: 
= nti 
Tan T4 (n— 1) ti 
where ty, = the reliability of a test n times as long as the original test 
ra =the reliability of the original test 


and if 40 items very similar to the first 


F R P E E -i is .70, $ 
If the reliability of a 20-item test is .7' Oe ett test i cere 


20 items are added to the test, the reliability © 
as follows: 


Ce 
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_ 2.10 
IFL 


_ 2.10 
~ 2.40 
= .88 


Some cautions should be heeded in applying the formulas for estimating the 
reliability of a lengthened test. The formulas assume that when the test is 
lengthened the new items will be of the same kind as the original test items. 
If the new items relate to factors other than those in the original items or if the 
new items are generally easier or more difficult than the original items, the 
formulas will give misleading estimates. However, in spite of the assumptions 
which the formulas make, they provide a very useful, and usually a reasonably 
accurate, estimate of the gain in reliability that will come from lengthening a 


shorter version of a particular test in order to reduce the time for administering 
test materials, For example, if the reliability of a 100-item test is | nown, and it 
is desired to use only 50 of the items, say, choosing only the odd-numbered 
items, the reliability of the 50-item test can be estimated by substituting 14 for n 
in the equation above and substituting the reliability of the 100-itern test for fir 


APPENDIX 6 


Sampling Error of Percentages 


In opinion polling or in any other use of samples of persons, the results 
obtained are only estimates of the true population values. Sampling error was 
illustrated in Chapter 13 by considering the results of polling 1,000 persons, in 
which it is found that 55 per cent of the people say that they will vote for 
candidate A and the remaining 45 per cent say they will vote for candidate B. 
Before it can be safely said that candidate A has an edge over candidate B, it 
must be ensured that the apparent difference in voting sentiment is not likely to 
be due to sampling error only. If more than 50 per cent of the population actu- 
ally vote for A, he will be elected. Consequently, the question is whether the 
sample value of 55 per cent who say they will vote for A is significantly greater 
than 50 per cent. This can be determined by the standard error of the per- 


centage, which is obtained as follows: 
= [PQ 
=A N 
where P = the hypothetical population percentage 


Q= 100 >P 
N = the number of persons in the sample 


The formula is applied to the illustrative problem as follows: 


50 x 50 
o= N 


y 2,500 
= V 1,000 
= V250 
= 1.58 


What thi i i les of 1,000 persons are drawn at random from 
hat this means is that, if samples p Paoa le T 


the population, the expected standard deviation ] 
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The resulting value is a standard score, and its significance can be found in 
Appendix 3. There it is seen that a standard score of 3.20 is nearest to the 
obtained value of 3.16. Only .00138 of the cases lie above and below 3.20 stand- 
ard scores. Here we are interested only in the proportion of people in one tail 
of the distribution. Our only concern is eer the population per cent is 
greater than 50 per cent. If it is, it makes no difference whether the population 
per cent is actually larger than the sample value of 55 per cent. Therefore, the 
value found in Appendix 3 should be divided by 2, which gives .00069. In 
other words, with a population per cent of 50 and with samples of 1,000 each, 
less than one time in a thousand would we expect to find sample values of 
55 per cent or greater. 

Another problem in which sampling error formulas are needed is in deter- 
mining the significance of the difference between the results obtained from two 
independent samples. For example, the results of one poll might show that 
55 per cent of the people intend to vote for candidate A, and a later poll, using 
a different sample of persons, shows that 60 per cent of the people intend to 
vote for A. It is tempting to interpret the results as showing an increased popu- 
larity of candidate A. However, before this can be done, it is nece ary to test 
the significance of the difference between the two sample results. Formulas for 
the significance of the difference between two percentages and for more com- 
plex sampling error problems can be found in standard statistical texts. 


' See A. L. Edwards, Experimental Design in Psychological Research, New York: 
Rinehart, 1950; Q. McNemar, Psychological Statistics, 2d ed., New York: Wiley, 
1955; Helen M. Walker and J. Lev, Statistical Inference, New York: Holt, 1953. 


APPENDIX 7 


Computation of Correlations among Q-sorts 


One of the handy features of the Q-sort rating scheme is that it permits the 
computation of correlation coefficients with a minimum of labor. Because the 
same peat is used for all sorts in a particular study, the following formula 
can be used: 


2 
E a 


2No* 


where r= the product-moment correlation between two sorts 
N = the number of items in the sample (100 in the illustrative sample of 
photographs of art objects mentioned in Chapter 17) 
o = the standard deviation of the forced distribution, which is the same 
for all sorts using the distribution 
d2 = the squared difference between the pile number for an item in two 
rticular item is placed in pile 7 in one sort an 


sorts; thus, if a pa 
pile 9 in another, the d2 is 4. The sum of d? is taken over the N 


items 


The term 2No2 is a constant for all correlations computed with a particular 
forced distribution. After the constant is determined, it is only necessary to 
obtain the sum of squared differences between item placements to compute the 
correlation. For any particular Q-sort distribution, a table can be composed to 
transform sums of squared differences directly into correlation coefficients. The 


formula effects a great time saving for the individual who does not have high- 
speed computers available. A correlation can be determined by the formula in a 
se the customary product-moment 


small fraction of the time it takes to u 
formulas. 
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Some Semantic Differential Scales Which Have 
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Been Used in Various Studies 


. angular-rounded 
. bass-treble 

. beautiful-ugly 

. black-white 

. blatant-muted 

. boring-interesting 


brave-cowardly 


. bright-dark 

. calm-agitated 

. calming-exciting 

. chaotic-ordered 

. clean-dirty 

. clear- 

. colorful-colorless 
. concentrated-diffuse 
. controlled-accidental 
. deep-shallow 

. definite-uncertain 
. deliberate-careless 
. dull-exciting 

. dull-sha: 

$ RRE 
. empty-full 

. even-uneven 

. expensive-chea 

i Fak unfair 3 

. familiar-strange 

. fast-slow 

. ferocious-peaceful 
. formal-informal 

. fragrant-foul 

. fresh-stale 

. gentle-violent 

. genuine-artificial 
. gliding-scraping 

. good-bad 


. happy-sad 

. hard-soft 

. healthy-sick 

. heavy-light 

. high-class—low-class 
. high-low 

. honest-dishonest 

. hot-cold 

. humorous-serious 

. important-trivial 

. intimate-remote 

. kind-cruel 

. labored-easy 

. large-small 

. long-short 

. loose-tight 

. loud-soft 

. lush-austere 

. masculine-feminine 
. meaningless-meaningful 
. mild-intense 

. modern-old-fashioned 
. near-far 

. nice-awful 

. obvious-subtle 

. pleasant-unpleasant 
. pleasing-annoying 
. powerful-weak 

. pungent-bland 

. pushing-pulling 

. rumbling-whining 
. red-green 

. relaxed-tense 

. repeated-varied 

. repetitive-varied 

. resting-busy 


rich-poor 
rich-thin 


j. rough-smooth 


rugged-delicate 


. sacred-profane 
‘5. safe-dangerous 


simple-complex 
sincere-insincere 
solid-hollow 
static-dynamic 


. steady-fluttering 
í strong ee 


. supe 
\. sweet-bitter 


cial-profound 


| 
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APPENDIX 9 


The Influence of Reliability 
on Predictive Validity 


In Chapter 6, we discussed the correction for attenuation, which estimates 
how highly a test would correlate with an assessment if both were made per- 
fectly reliable. A more useful formula is that for estimating the validity of a 
test if the reliabilities of the test and the assessment are raised or lowered by 
particular amounts. The formula is as follows: 


TeV VT 99 
v^ Vr. 22 


where r1; and ry» = the original reliabilities of the predictor test and the assess- 
ment respectively 
7, and r}, = the new reliabilities of the predictor test and assessment 
T12 = the original correlation between the test and the assessment 
12 = the estimated correlation between the test and the assess- 
ment after their respective reliabilities have been increased 
or decreased 


The formula can be illustrated in a situation in which a test is used to predict 
ratings of performance, Assume that the original reliabilities of the test and the 
seer ent are .64 and! 49 respectively and that the correlation between the 
test and the assessment is .36, Suppose that, by improving the test and the 
assessment, their reliabilities are raised to .81 and .64 respectively. The esti- 


re validity of the test after the reliabilities are increased is obtained as 
ollows: 


i 


riled 36 \/.81 \/.64 
SS 


= 46 


The formula will also serve to estimate the validity of a test if the reliabilities of 
the test and the assessment are decreased or if one is increased and the other 
decreased. 
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Experiments, measurement in, 373-377 


Face validity, 66 
Factor analysis, 164-174 
communality in, 169 
in test construction, 145-147 
vector model in, 168-172 
Factored attitude scales, 310-311 
Factorial validity, 67 
Faculty psychology, 164 


Faking on self-description inventories, 
335-336 


Farnsworth Dichotomous Test for Color 
Blindness, 238 

Farnsworth Scales, 258 

Farnsworth-Munsell 100 Hue Test, 238 

Flanagan Aptitude Classification Tests 
(FACT), 189 

Forced-choice inventories, 336 

Forced distribution, use with Q-sort, 
877-378 
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G factor from Spearman's work, 164-167 

Gates Rasic Reading Tests, 276 

Genera! Aptitude Test Battery (GATB), 
189, 253 

Gener! Clerical Test, 253-254 

“General intelligence” tests, 191-235 

General reasoning factor, 176 

Gesell Development Schedules, 218-219 


Gottshuldt Figures Tests, 363 
Grade norms, 53 

“Grading on the curve,” 54 
Graduate Record Examination, 280 
Graphic art aptitudes, 258-262 
Graves Design Judgment Test, 260 


Group tests, for college students, 216- 
218 
compared to individual tests, 192-194, 
213-227 


Guessing, correction for, 54-56 
influence on reliability, 56, 105-106 

Guilford-Zimmerman Aptitude Survey, 
189 

Guilford-Zimmerman Temperament Sur- 
vey, 329-330, 332 

Guttman method of attitude scaling, 308- 
310 


Hearing tests, 238-240 

Henmon-Nelson Tests of Mental Ability, 
215 

High school achievement tests, 278-280 

Holmgren Woolens, 238 

Hom Art Aptitude Inventory, 260-261 


Ideal profiles, 129-130 
Illuminant-Stable Color Vision Test, 238 
“Indifferent zone” in personnel selection, 
94 
Individual tests compared to group tests, 
192-194, 196-213 
Infants, tests for, 218-220 
Information sources for tests, 411-413 
Inheritance of intelligence, 228-229 
Ink blot test, 340-343 
Instructional techniques, measurement of, 
with achievement tests, 273-274 
Instructions for taking tests, 159-160 
Intelligence quotient (IQ), 53 
computation of, 199-200 
use of, with adults, 206 
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Intelligence Test for Young Children, 222 
7 ” tests, 191-227 
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Interest inventories, 315-319 
lowa Silent Reading Tests, 281 
Iowa Tests of Educational 
280 
Ishihara Color Plates, 238 
Item analysis, 144-149 
Item-criterion correlations, 148-150 
Item-interrelationship reliability, 100-110 
Item pool, 143 
Item universe, 143 


Job analysis, 141-142 


Knauber Art Ability Test, 261 

Knowledge tests in personality measure- 
ment, 365-366 

Kuder Preference Record, 317-319 

Kuhlmann-Anderson Intelligence Tests, 
214-216 

Kwalwasser-Dykema Music Tests, 257 


Labels in measurement (see Scales of 
measurement) 

Least-squares criterion, 75-76 

Leiter International Performance 


226 
Length of test, influence on reliability, 
111, 429-430 


Level in le analysis, 126-127 
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McAdory Art Test, 258-259, 262 
Maladjustment, measurement of, 321-327 
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Meaning, measurement of, 383-389 
Meaningful memory factor, 177-178 
Measurement error, derivation of for- 
mulas for, 424-428 
Mechanical aptitude, 244-253 
Median, 36 
Meier Art Judgment Test, 259-260 
Memory factors, 177-178 
Memory span factor, 178 
Mental age, 29, 52 
Merrill-Palmer Scale, 222 
Metropolitan Achievement Tests, 277- 
278 
Miller Analogies Test, 218 
Minimal Social Behavior Scale, 363 
Minnesota Clerical Test, 253 
Minnesota Multiphasic Personality In- 
ventory (MMPI), 324-327 
Minnesota Paper Form Board Test (re- 
vised), 247 
Minnesota Preschool Scale, 220-222 
Mode, 36 
Motor dexterity, 241-244 
Miiller-Lyer illusion, 18-19 
Multifactor batteries of tests, 181-190 
Multiple-choice examinations, 153-159 
Multiple correlation, 114-123 
computation of, 120 
shrinkage of, 121-122 
standard error of estimate for, 121 
statistical significance, 121-123 
Multiple-cutoff method, 137-139 
Multiple-factor analysis, 167-172 
Multiple regression, 119-121 
Multitest method of item analysis, 144- 
147 
Multivariate prediction, 114-140 
Musical aptitudes, 255-258 


National Intelligence Test, 214-215 

Naval Knowledge Test, 365-366 

Nelson-Denny Reading Test: Vocabulary 
and Paragraph, 276 

Newspapers, content analysis of, 399-400 

Normal distribution, 24-95, 43-46, 421- 
423 

Norms, 48-53 

Northwestern Infant Intelligence Tests, 
220 

Number factors, 176 

Numerical computation factor, 176 
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Observation of behavior, 356-363 
Ohio State University Psychological Test, 
Form 21, 218 
Opinion polls, 286-300 
Ordinal scales, 8-12 
(See also Attitude scales) 
Ortho-Rater, 238 
Otis Alpha, 214, 216 
Otis Beta, 215 


Outline of content in test construction, 
152-153 
Pair comparisons, method of, in psycho- 


physics, 20-21 
Panels, use of, in opinion polling, 292- 
293 
Percentages, sampling error of, 431-432 
Percentile ranks, 47-48 
Perceptual closure factor, 179-180 
Perceptual factors, 179-180 
Perceptual speed factor, 179 
Performance tests, 222-225 
compared to verbal tests, 194-196 
Persistence tests, 364 
Personal Data Sheet, 323-324 
Personal equation, 15 
Personality factors, 329-330 
Personnel selection, 409-411 
Physique relation to personality, 367 
Pintner-Cunningham Primary Test, 214- 
216 
Pintner-Paterson Performance Scale, 222- 
225 
Play techniques as projective tests, 347- 
348 
Potency factor on Semantic Differential, 
385 
Prediction, of accidents, 410-411 
differential, 124-125 
of elections, 287-289 
multivariate, 114-140 
statistics of, 71-94 
Predictor tests, evaluation of, 58, 62-65 
Preschool children, tests for, 220-222 
Primary Mental Abilities Test (PMA), 
181-183 
Processes, definition of, 5 
Profile analysis, 126-134 
shape in, 127-128 
of Wechsler-Bellevue test, 210-211 
Progressive Matrices Test, 226-227 
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Projection, 339 

Projective techniques, 338-353 
Propaganda, content analysis of, 400 
Psychophysies, 15-21 


Q-sort, 877-388 
Quota sample, 291 


Random sample, 290 
Range as measure of dispersion, 38 
Rank-order method in psychophysics, 20 
Rank-order scores, 46-47 
Rating methods in personality measure- 
ment, 357-359 
Ratio scales, 8-12 
(See also Attitude scales) 
Readability, 396-397 
Reading, achievement tests for, 266-267, 
275-276 
Reasoning factors, 176-177 
Regression equations, 85-86 
Reliability, of difference scores, 128-129 
of measurements, 95-113 
effect on, of content sampling, 106 
of guessing, 56, 105-106 
of test length, 111, 429-430 
of testing environment, 105 
equivalent-form, 108 
item-interrelationship, 109-110 
measurement error, 95-98 
reliability coefficient, 99-100 
“true” scores estimated from, 98- 
102 
retest, 108 
sources of unreliability, 104-107 
split-half, 109 
Response sets, 364-365 
Retest reliability, 108 
Rorschach Test, 340-343, 345, 346, 349- 
352 
Rote memory factor, 177 


Sampling, bias in, 49, 287-291 
of content, effect on reliability, 106 
in relation to opinion polls, 289-293 
Scales of measurement, 8-12 
(See also Attitude scales) 
Scatter in profile analysis, 127 
Scatter diagrams, 79-84 
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School programs, use of tests in, 408-409 
Seashore Measures of Musical Talents, 
255-257 
Seashore-Heyner Tests for Attitude to- 
ward Music, 258 
Selection ratio, 92 
Self-conception, measurement of, with 
Q-sort, 381-382 
Self-description inventories, 321-337 
acceptability influence on, 335-336 
faking on, 335-336 
“validity scores” on, 326-327 
Self-report inventories, 321-337 
Semantic Differential, 383-389 
Semantic Test of Intelligence, 226 
Sensory tests, 236-240 
Sentence completion test, 347 
Shape in profile analysis, 127-128 
Shipley Personal Inventory, 336 
Shrinkage of multiple correlations, 121- 
122 
Siblings, correlation of intelligence test 
scores, 228-229 
Sight-Screener, 238 
Similarity judgments, 389-391 
Situational tests, 359-361 
Sixteen Personality Factor Questionnaire, 
330 
Snellen Chart, 287-238 
Social distance in measurement of atti- 
tudes, 306-308 
Social traits, measurement of, 321, 328- 
330 
Sociogram, 391-393 
Sociometry, 391-393 
Spatial factors, 178-179 
Spatial orientation factor, 178 
Spatial visualization factor, 179 
Specimen sets of tests, 411 
Speed, relationship to ability factors, 
180-181 
Split-half reliability, 109 
Spot-check sample, 291-292 
SRA Mechanical Aptitude Test, 251 
Standard deviation, 39-41 
Standard error, of correlations, 88-90 
of estimate for, 79-82 
of measurement, 100-102 
Standard scores, 45-46 
Standardized Oral Reading Paragraphs, 
281 
Stanford Achievement Tests, 278 


446 


Stanford-Binet tests, 197-207 

Statistics, growth of, 22-95 

Stenographic ability, 254 

Stratified-random sample, 291 

Stromberg Dexterity Test, 241-242 

Structure of human abilities, 163-190 

Structured stimuli compared to unstrue. 
tured stimuli, 838-339 

Success ratio, 92 

Suggestibility tests, relation to personal- 
ity, 366 

Superior adults, group tests for, 216-218 

Suppressor tests, 123 


Telebinocular, 238 

Terman-McNemar Test of Mental Abil- 
ity, 216-217 

Test construction, 141-162 

Test length, effect on reliability, 111, 
429-430 

Testing environment, effect on reliability, 
105 


Thematic Apperception Test (TAT), 
339, 343-352, 402 

Thurstone method of attitude scaling, 
302-305 

Thurstone Temperament Schedule, 330 

Trait clusters, 334-335 

Transformations of score distributions, 46, 


Triads, method of, 389-391 

“True” scores estimated from reliability 
coefficient, 98-102 

Twins, resemblance of intelligence test 
Scores, 228-299 


Unitest item analysis, 147-149 
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Unstructured stimuli compared to struc- 
tured stimuli, 338-339 


Validity, construct, 65-66 
of tests, 57-69 

“Validity scores” on self-description in- 
ventories, 326-327 

Variance diagrams, 115-118 

Variance ratio, 115 

Vector model in factor analysis, 168-172 

Verbal comprehension factor, 174-175 

Verbal factors, 174-175 S 

Verbal tests compared to performance 
tests, 194-196 

Visual abilities, 237-238 

Visual memory factor, 178 

Vocational guidance, 272, 318-319, 406- 
407 

Vocational Interest Blank (VIB), 316- 
319 


Watson-Glaser Critical Thinking Ap- 
praisal, 279-280 

Wechsler Adult 
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Wechsler Intelligence Scale for Children 
(WISC), 212-213 

Wechsler-Bellevue Intelligence 
207-211 

Wing Standardized Tests of Musical In- 
telligence, 257 

Word association test, 346-347 

Word fluency factor, 175 

Wording of questions, on essay examina- 
tions, 155 

on multiple-choice examinations, 157- 
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